|
This file uses the W3C HTML Slidy format. The "a" key toggles between one-slide-at-a-time and single-page mode, and the "c" key toggles on and off the table of contents. The ← and → keys can be used to page forward and backward. For more help on controls see the "help?" link at the bottom. |
![]() |
Image from http://rogerpielkejr.blogspot.com/2009/12/spinning-literature.html
This is not inherently a hierarchy; that is, none of these logical mechanisms completely subsumes the other. For example, disabling interrupts partially subsumes disabling preemption, since it prevents the scheduler from being invoked through an interrupt. However, then care must still be taken not to indirectly call the scheduler through another kernel function.
On the other hand, they are usually implemented and are most sanely used in this hierarchical order, to prevent deadlocks.
to claim a lock:
loop
-- the following "test-and-set" pseudo-code
-- must be implemented atomically at memory level
if not spinlock.locked then
spinlock.locked := true;
claim_succeeded := true;
else
claim_succeeded := false;
end if;
-- end of atomic section
if set_succeeded then exit; end if;
end loop;
to release lock: spinlock.locked := false; |
|
Instead of the test-and-set idiom, the above can also be stated in terms of decrement-and-test, test-and-decrement, increment-and-test, test-and-increment, compare-and-swap, etc.
Image from http://www.soc.hawaii.edu/leonj/leonj/leonpsy16/g16reports-instructions.html
This slide has additional "handout" notes. You need to use the 'a' key to toggle into handout mode, in order to see them.
I regret having to say that the spinlock code has exploded in complexity over the years. Much of this complexity comes from the addition of system tuning and debugging features, including tracing and consistency checking. The desire to add consistency checks is understandable, as errors in lock usage are potentially very difficult to detect and correct. It is just a shame to see how complicated and difficult to read the code has become.
The following notes are based on the Linux spinlock implementation code for the i386 architecture, as of Linux kernel version 2.6.25.3. The spinlock implementation code is a moving target, so the current version is likely to differ in details. However, the underlying principles and techniques should be fairly invariant.
The code in the notes below does not all exactly match the code in that kernel version, mainly because the actual code is broken up into a set of macros and the code below has been combined into a single block. There may also have been changes to the Linux distribution code since the notes below were extracted from it. However, any such changes do not reduce the validity of the examples below for illustrating the principles and techniques.
The implementation uses macros, like the following:
#ifdef CONFIG_DEBUG_SPINLOCK
extern void __spin_lock_init(spinlock_t *lock, const char *name,
struct lock_class_key *key);
#define spin_lock_init(lock) \
do { \
static struct lock_class_key __key; \
\
__spin_lock_init((lock), #lock, &__key); \
} while (0)
#else
#define spin_lock_init(lock) \
do { *(lock) = SPIN_LOCK_UNLOCKED; } while (0)
#endif
If we ignore the debugging case and error-checking code, and work our way through the layers of macro calls the above functionally reduces to
lock->raw_lock = 1
This slide has additional "handout" notes. You need to use the 'a' key to toggle into handout mode, in order to see them.
The "do{ ... }while()" is a syntactic wrapper that prevents spin_lock_init from being used as an expression, i.e., makes sure it belongs to the syntactic category statement. In general, it is important for the execution order of synchronization operations to be uniquely defined, as is the case for statement execution order. In contrast, the compiler is free to reorder the evaluation of subexpressions, subject to the limits of parenthesization.
This is a conventional way of enclosing a block of code in a macro. To see why simple braces "{ ... }" would not do as well, consider the following:
if (foo)
spin_lock_init(lock);
else
do_something_else();
This looks (and should be) perfectly legal, but if you protected the macro only with '{ }' you would get the following:
if (foo)
{ lock->raw_lock = 1; };
else
do_something_else();
The {} block will be interpreted as the if-block (ok so far). The semicolon will be taken as a null statement. That is, since it is not an "else" the compiler concludes that the "if" doesn't have a matching "else". When it then sees the "else" on the next line it will treat that as a syntax error. A "do {} while(0)" is the correct way of making sure that the semicolon following the macro always gets eaten by the parser. A decent compiler will not generate any extra machine instructions for this degenerate form of loop, so it does not add execution time overhead.
If the kernel is compiled without the SMP configuration option (SMP stands for Symmetric Multi-Processor) there is no need for spin-locks, except for the side-effect of preventing preemption, so in that case reduced bodies are provided for all the operations on spinlocks.
In an earlier version of Linux this was once fairly clear. For example:
#ifndef __SMP__
...
#define spin_lock_init(lock) do { } while(0)
...
#else /* __SMP__ */
...
#define spin_lock_init(x) do { (x)->lock = 0; } while(0)
...
#endif /* __SMP__ */
However, with the addition of options for checking spinlock usage, this has been obscured in the present spinlock_types.h:
#if defined(CONFIG_SMP) #include <asm/spinlock_types.h> #else #include <linux/spinlock_types_up.h> #endif
Still, if you work your way through the conditional code and macro definitions for the uniprocessor (UP) case with no debugging and checking options, the above has the side-effect of effectively suppressing code generation for calls to spin_lock_init(), by declaring raw_spinlock_t to be an empty struct, and __RAW_SPIN_LOCK_UNLOCKED to be empty braces ("{ }"). Similarly if CONFIG_SMP is not defined all but the non-preeemption feature of the other spinlock operations is suppressed.
typedef struct {
raw_rwlock_t raw_lock;
#ifdef CONFIG_GENERIC_LOCKBREAK
unsigned int break_lock;
#endif
#ifdef CONFIG_DEBUG_SPINLOCK
unsigned int magic, owner_cpu;
void *owner;
#endif
#ifdef CONFIG_DEBUG_LOCK_ALLOC
struct lockdep_map dep_map;
#endif
} rwlock_t;
If you strip off the debugging extras, and get down to the raw spinlock object for the SMP case for the i386, it looks like this:
typedef struct raw_spinlock{ unsigned int slock; } raw_spinlock_t;
The value of lock.raw_locklock.slock is an integer. The value is initialised to zero (__RAW_SPIN_LOCK_UNLOCKED), indicating the lock is the unlcoked state.
This slide has additional "handout" notes. You need to use the 'a' key to toggle into handout mode, in order to see them.
The spinlock value actually consists of two 8-bit fields. Any pair of matching fields indicates the lock is free. Sixteen bits of zero (high=0, low=0) is just one such case.
Earlier versions of Linux had the raw lock value field slock declared as "volatile". Do you know what that means? Why is that not done in the 2.6.25 kernel?
void __lockfunc _spin_lock(spinlock_t *lock) { preempt_disable(); spin_acquire(&lock->dep_map, 0, 0, _RET_IP_); LOCK_CONTENDED(lock, _raw_spin_trylock, _raw_spin_lock); }
The call spin_acquire appears to be entirely for debugging and tracing.
The real locking is done in the macro LOCK_CONTENDED, which eventually calls the function _raw_spin_lock(lock), which is passed in as its third parameter.
Note that __lockfunc is a macro that specifies some gcc compilation and linkage options to optimize performance.
Quoted from the Linux spinlock code comments:
Ticket locks are conceptually two parts, one indicating the current head of the queue, and the other indicating the current tail. The lock is acquired by atomically noting the tail and incrementing it by one (thus adding ourself to the queue and noting our position), then waiting until the head becomes equal to the the initial value of the tail.
We use an xadd covering *both* parts of the lock, to increment the tail and also load the position of the head, which takes care of memory ordering issues and should be optimal for the uncontended case. Note the tail must be in the high part, because a wide xadd increment of the low part would carry up and contaminate the high part.
Delving through the macro and function calls for __raw_spin_lock, and expanding the various macro calls, we eventually obtain the equivalent of the following machine code, for the case where the number of CPU's is less than 256.
short inc = 0x0100;
...
lock;
xaddw %bx, lock->slock
1:
cmpb %bh, %bl
je 2f
rep; nop /* I haven't figured out why these are needed, possibly for alignment? */
movb lock->slock, %bl
/* don't need lfence here, because loads are in-order */
jmp 1b
2:
This slide has additional "handout" notes. You need to use the 'a' key to toggle into handout mode, in order to see them.
I used the following facts about gcc in-line assembly code to derive the above from the in-line gcc assembly code:
It is not reasonable for anyone to remember these details, unless working on a daily basis with x86 machine code and gcc. If you understand the basic principles of assembly code You can look them up whenever you need them, as I did.
The actual code also includes the following:
.section .smp_locks,
.balign 4 /* word alignment */
.long 661f /* address */
.previous
661:
The ".section" pseudo-operation causes the assembler to put the following block of code, up to the next ".previous" pseudo-operation, out-of-line, in a different text (i.e., code) section. In this case, it is appending an entry to a table of entry point addresses for the spinlock routines.
You can ignore this, for the purpose of understanding how the spinlocks work.
lock; xaddw %bx, lock->slock
temp = lock->slock;
lock-> = temp + %bx;
%bx = temp
1: cmpb %bh, %bl; je 2f
movb lock->slock, %bl; jmp 1b
However, with a multiprocessor system, spinlocks are generally a necessity.
This slide has additional "handout" notes. You need to use the 'a' key to toggle into handout mode, in order to see them.
The effects of problem (1) can be bounded if the lengths of the critical sections (blocks of code over which the lock is held) for each lock are kept very short, and the frequencies of execution of critical sections are kept low.
Problem (2) can be addressed as in the x86 example.
Problem (3) can be addressed by using a more sophisticated locking algorithm, that takes into account priorities. It is an interesting question how much additional execution time overhead is incurred by using such an algorith (which is necessarily somewhat slower), and how the processor utilization lost due to this overhead compares against the gain in utilizable processing capacity due to reduced priority inversion.
Think hard before deciding to use one of these specialized locks. Have you done a performance study to determine whether it is helping or hurting.
How long do we need to wait until the old copy can be deleted?
What happens with concurrent updates?
Sorry, I have not had time to write up anything more detailed on these, which do deserve attention.