Linux Kernel & Device Driver Programming

Spin Locks & Other Forms of Mutual Exclusion

This file uses the W3C HTML Slidy format. The "a" key toggles between one-slide-at-a-time and single-page mode, and the "c" key toggles on and off the table of contents. The ← and → keys can be used to page forward and backward. For more help on controls see the "help?" link at the bottom.

Image from http://rogerpielkejr.blogspot.com/2009/12/spinning-literature.html

Basic Mutual Exclusion Mechanisms

  1. To exclude scheduled threads: disable preemption, in the thread scheduler
  2. To exclude interrupt handlers: mask or disable interrupts, in the hardware
  3. To exclude other processors: use spinlock (busy-wait), using a bus-atomic polling operation

This is not inherently a hierarchy; that is, none of these logical mechanisms completely subsumes the other. For example, disabling interrupts partially subsumes disabling preemption, since it prevents the scheduler from being invoked through an interrupt. However, then care must still be taken not to indirectly call the scheduler through another kernel function.

On the other hand, they are usually implemented and are most sanely used in this hierarchical order, to prevent deadlocks.

General spinlock abstraction

to claim a lock:

loop
   --  the following "test-and-set" pseudo-code
   --  must be implemented atomically at memory level 
   if not spinlock.locked then
      spinlock.locked := true;
      claim_succeeded := true;
   else
      claim_succeeded := false;
   end if;
   --  end of atomic section
   if set_succeeded then exit; end if;
end loop;

to release lock:

spinlock.locked := false;

Instead of the test-and-set idiom, the above can also be stated in terms of decrement-and-test, test-and-decrement, increment-and-test, test-and-increment, compare-and-swap, etc.

Image from http://www.soc.hawaii.edu/leonj/leonj/leonpsy16/g16reports-instructions.html

General properties of spinlocks

Linux Spinlocks

Abbreviated History of Locks (Mutual Exclusion) in Linux

This slide has additional "handout" notes. You need to use the 'a' key to toggle into handout mode, in order to see them.

I regret having to say that the spinlock code has exploded in complexity over the years. Much of this complexity comes from the addition of system tuning and debugging features, including tracing and consistency checking. The desire to add consistency checks is understandable, as errors in lock usage are potentially very difficult to detect and correct. It is just a shame to see how complicated and difficult to read the code has become.

The following notes are based on the Linux spinlock implementation code for the i386 architecture, as of Linux kernel version 2.6.25.3. The spinlock implementation code is a moving target, so the current version is likely to differ in details. However, the underlying principles and techniques should be fairly invariant.

The code in the notes below does not all exactly match the code in that kernel version, mainly because the actual code is broken up into a set of macros and the code below has been combined into a single block. There may also have been changes to the Linux distribution code since the notes below were extracted from it. However, any such changes do not reduce the validity of the examples below for illustrating the principles and techniques.

Macros

The implementation uses macros, like the following:

#ifdef CONFIG_DEBUG_SPINLOCK
  extern void __spin_lock_init(spinlock_t *lock, const char *name,
                               struct lock_class_key *key);
#define spin_lock_init(lock)                         \
do {                                                 \
        static struct lock_class_key __key;          \
                                                     \
        __spin_lock_init((lock), #lock, &__key);     \
} while (0)

#else
#define spin_lock_init(lock)                         \
        do { *(lock) = SPIN_LOCK_UNLOCKED; } while (0)
#endif

If we ignore the debugging case and error-checking code, and work our way through the layers of macro calls the above functionally reduces to

lock->raw_lock = 1

This slide has additional "handout" notes. You need to use the 'a' key to toggle into handout mode, in order to see them.

The "do{ ... }while()" is a syntactic wrapper that prevents spin_lock_init from being used as an expression, i.e., makes sure it belongs to the syntactic category statement. In general, it is important for the execution order of synchronization operations to be uniquely defined, as is the case for statement execution order. In contrast, the compiler is free to reorder the evaluation of subexpressions, subject to the limits of parenthesization.

This is a conventional way of enclosing a block of code in a macro. To see why simple braces "{ ... }" would not do as well, consider the following:

   if (foo)
      spin_lock_init(lock);
   else
      do_something_else();

This looks (and should be) perfectly legal, but if you protected the macro only with '{ }' you would get the following:

   if (foo)
        { lock->raw_lock = 1; };
     else
        do_something_else();

The {} block will be interpreted as the if-block (ok so far). The semicolon will be taken as a null statement. That is, since it is not an "else" the compiler concludes that the "if" doesn't have a matching "else". When it then sees the "else" on the next line it will treat that as a syntax error. A "do {} while(0)" is the correct way of making sure that the semicolon following the macro always gets eaten by the parser. A decent compiler will not generate any extra machine instructions for this degenerate form of loop, so it does not add execution time overhead.

Conditional Compilation

If the kernel is compiled without the SMP configuration option (SMP stands for Symmetric Multi-Processor) there is no need for spin-locks, except for the side-effect of preventing preemption, so in that case reduced bodies are provided for all the operations on spinlocks.

In an earlier version of Linux this was once fairly clear. For example:

#ifndef __SMP__
...
#define spin_lock_init(lock)    do { } while(0)
...
#else   /* __SMP__ */
...
#define spin_lock_init(x)       do { (x)->lock = 0; } while(0)
...
#endif /* __SMP__ */

However, with the addition of options for checking spinlock usage, this has been obscured in the present spinlock_types.h:

#if defined(CONFIG_SMP)
#include <asm/spinlock_types.h>
#else
#include <linux/spinlock_types_up.h>
#endif

Still, if you work your way through the conditional code and macro definitions for the uniprocessor (UP) case with no debugging and checking options, the above has the side-effect of effectively suppressing code generation for calls to spin_lock_init(), by declaring raw_spinlock_t to be an empty struct, and __RAW_SPIN_LOCK_UNLOCKED to be empty braces ("{ }"). Similarly if CONFIG_SMP is not defined all but the non-preeemption feature of the other spinlock operations is suppressed.

Representation of Spin Lock Object

typedef struct {
        raw_rwlock_t raw_lock;
#ifdef CONFIG_GENERIC_LOCKBREAK
        unsigned int break_lock;
#endif
#ifdef CONFIG_DEBUG_SPINLOCK
        unsigned int magic, owner_cpu;
        void *owner;
#endif
#ifdef CONFIG_DEBUG_LOCK_ALLOC
        struct lockdep_map dep_map;
#endif
} rwlock_t;

If you strip off the debugging extras, and get down to the raw spinlock object for the SMP case for the i386, it looks like this:

typedef struct raw_spinlock{
        unsigned int slock;
} raw_spinlock_t;

The value of lock.raw_locklock.slock is an integer. The value is initialised to zero (__RAW_SPIN_LOCK_UNLOCKED), indicating the lock is the unlcoked state.

This slide has additional "handout" notes. You need to use the 'a' key to toggle into handout mode, in order to see them.

The spinlock value actually consists of two 8-bit fields. Any pair of matching fields indicates the lock is free. Sixteen bits of zero (high=0, low=0) is just one such case.

Earlier versions of Linux had the raw lock value field slock declared as "volatile". Do you know what that means? Why is that not done in the 2.6.25 kernel?

The spin_lock Operation

void __lockfunc _spin_lock(spinlock_t *lock)
{
        preempt_disable();
        spin_acquire(&lock->dep_map, 0, 0, _RET_IP_);
        LOCK_CONTENDED(lock, _raw_spin_trylock, _raw_spin_lock);
}

The call spin_acquire appears to be entirely for debugging and tracing.

The real locking is done in the macro LOCK_CONTENDED, which eventually calls the function _raw_spin_lock(lock), which is passed in as its third parameter.

Note that __lockfunc is a macro that specifies some gcc compilation and linkage options to optimize performance.

"Ticket Lock" Principles of Operation

Quoted from the Linux spinlock code comments:

Ticket locks are conceptually two parts, one indicating the current head of the queue, and the other indicating the current tail. The lock is acquired by atomically noting the tail and incrementing it by one (thus adding ourself to the queue and noting our position), then waiting until the head becomes equal to the the initial value of the tail.

We use an xadd covering *both* parts of the lock, to increment the tail and also load the position of the head, which takes care of memory ordering issues and should be optimal for the uncontended case. Note the tail must be in the high part, because a wide xadd increment of the low part would carry up and contaminate the high part.

The Spin Lock Machine Code for x86

Delving through the macro and function calls for __raw_spin_lock, and expanding the various macro calls, we eventually obtain the equivalent of the following machine code, for the case where the number of CPU's is less than 256.

       short inc = 0x0100;
       ...
       lock;
       xaddw %bx, lock->slock
1:
       cmpb %bh, %bl
       je 2f
       rep; nop  /* I haven't figured out why these are needed, possibly for alignment? */
       movb lock->slock, %bl
       /* don't need lfence here, because loads are in-order */
       jmp 1b
2:

This slide has additional "handout" notes. You need to use the 'a' key to toggle into handout mode, in order to see them.

I used the following facts about gcc in-line assembly code to derive the above from the in-line gcc assembly code:

It is not reasonable for anyone to remember these details, unless working on a daily basis with x86 machine code and gcc. If you understand the basic principles of assembly code You can look them up whenever you need them, as I did.

The actual code also includes the following:

      .section .smp_locks,
       .balign 4 /* word alignment */
       .long 661f /* address */
       .previous
661:

The ".section" pseudo-operation causes the assembler to put the following block of code, up to the next ".previous" pseudo-operation, out-of-line, in a different text (i.e., code) section. In this case, it is appending an entry to a table of entry point addresses for the spinlock routines.

You can ignore this, for the purpose of understanding how the spinlocks work.

Step-By-Step Explanation

  1. lock; xaddw %bx, lock->slock
    • The prefix "lock" insures that the memory bus is locked for full completion of the following instruction, so that it appears atomic to other processors that share access to the lock.
    • The instruction "xaddw %bx, lock->slock" does an "exchange-and-add-word"; that is it does the equivalent of the following:
          temp = lock->slock;
           lock-> = temp + %bx;
           %bx = temp
      
    • The effect is to increment the high byte of the lock by one, and return the pre-increment value of the lock in register %bx
  2. 1: cmpb %bh, %bl; je 2f
    • The operation "cmpb %bh %bl" compares the low and high bytes of the register %bx.
    • The comparision will succeed if and only iff the two parts of the old (pre-increment) value of the lock were identical, which by convention means the lock is free.
    • If the lock is free, we jump to label 2, and return immediately.
  3. movb lock->slock, %bl; jmp 1b
    • The operation "movb lock->slock, %bl" sets the low byte of %bx to the low byte of lock->slock, which may still be the value it was when we just tried to seize the lock, or or it may have been incremented by an unlock operation.
    • The operation "jmpb 1b" takes us back to the label "1:", where we retry the comparison.
    • Eventually, an unlock operation will increment the low-order byte of "lock->slock" to match the high byte of our %bx and the comparison will succeed, meaning we have the lock.

Weaknesses of Spin Locks

  1. No useful work is done while a CPU is waiting for the lock
  2. It may be indeterminate which CPU will get the lock, if there is a high degree of contention for the lock there may be unbounded blocking.However, the "ticket lock" implementaiton above prevents this.
  3. Priority is not taken into account. That is, if jobs on several CPU's are contending for the lock, which CPU gets the lock is not determined by the priority of the job that is running on the CPU. With priority-based scheduling algorithms, such priority inversion is undesirable, because it reduces the schedulable CPU utilization.

However, with a multiprocessor system, spinlocks are generally a necessity.

This slide has additional "handout" notes. You need to use the 'a' key to toggle into handout mode, in order to see them.

The effects of problem (1) can be bounded if the lengths of the critical sections (blocks of code over which the lock is held) for each lock are kept very short, and the frequencies of execution of critical sections are kept low.

Problem (2) can be addressed as in the x86 example.

Problem (3) can be addressed by using a more sophisticated locking algorithm, that takes into account priorities. It is an interesting question how much additional execution time overhead is incurred by using such an algorith (which is necessarily somewhat slower), and how the processor utilization lost due to this overhead compares against the gain in utilizable processing capacity due to reduced priority inversion.

Special Kinds of Spinlocks

Think hard before deciding to use one of these specialized locks. Have you done a performance study to determine whether it is helping or hurting.

Other Synchronization Mechanisms Used in Linux

Sorry, I have not had time to write up anything more detailed on these, which do deserve attention.

RCU References

RCU and Unloadable Modules *
Hierarchical RCU *
Scaling dache with RCU .
Kernel Korner - Using RCU in the Linux 2.5 Kernel .
Wikipedia: Read-copy-update .