Linux Kernel & Device Driver Programming

Ch 5 - Concurrency and Race Conditions

 
....working here....

Sources of Concurrency


Race Conditions


Context of Locking


Linux Kernel Semaphores

down_interruptible

Example


Reader/Writer Semaphores

down_read

Completions

see linux/completion.h and example from scullp

Linux Kernel Spinlocks

spin_lock_init spin_unlock

Example

see scull

Spinlock Warnings

(preemption is disabled) no sleep short

Reader/Writer Spinlocks


General Locking Pitfalls

apply to all forms of locks document assumptions about holding lock on entry/exit from function avoid writing functions for which lock holding is not invariant over call try to avoid holding more than one lock at a time use lock ordering to prevent deadlock, when holding more than one lock is unavoidable document ordering convention, and follow it always lock local (used by just one module or one device within module) lock before locking a global (used by more than one module or device) lock

Alternatives to Locking


Lock-Free Algorithms

circular single-producer single-consumer buffer

Atomic Variables

beware a sequence of atomic operations is not itself atomic

seqlocks


Read-Copy-Update

http://www.rdrop.com/users/paulmck/rclock/intro/rclock_intro.html


Spin Locks

See examples of use in access.c:

See also the separate explanation of implementation for i386 architecture.


Implementing a Blocking open

See example code in access.c:


Solving Problems with Race Conditions

Start with the simplest solution. Fall back to more complex solutions only if necessary.

  1. try to avoid concurrent access entirely
  2. use circular buffers
  3. use spinlocks
  4. use lock variables to "roll your own" solution

Circular Bounded Buffers (FIFO)

  • head
    - next location to put data in
  • tail
    - next location to take data out
  • head=tail
    - buffer is empty
  • (head+1)%BUFFER_SIZE=tail
    - buffer is full

See example in short.c.


Notes below this point not yet updated to 3rd edition of book and Kernel 2.6. However, most of this material overlaps with material from the new chapter on locks, or other material we have already covered. When I update this material I will probably move most of it to the new notes on Chapter 5 Concurrency and Race Conditions.

This is a very important programming solution, for a frequently occurring problem. I learned it first from the CDC NOS operating system, where it is the standard method of doing I/O. It was needed there because I/O was done by multiple parallel processors (so-called Peripheral Processors). I have seen it used over and over since then. This is something worth learning and remembering.

When you use this buffering technique you don't need any special locking mechanisms. They key is that the IN and OUT variables each have only one writer. These variables divide the buffer into two regions, each of which also has only one writer. You should always think first of this solution when you have a producer-consumer situation, such as between an interrupt-handler and other functions of a device driver. (At the same, time you need to recognize that it is limited to the producer-consumer problem.)


Using Spinlocks


Spin Locks

See examples of use in access.c:

See also the separate explanation of implementation for i386 architecture.


Deadlock Hazards


Reader-Writer Spinlocks


Atomic Operations on Variables

while (test_and_set_bit(nr, addr) != -) wait_for_while();
...critical section...
if (test_and_clear_bit(nr, addr) == 0) something_went_wrong();

Here the textbook's terminology is potentially confusing. The term "lock variable" is used here for any variable on which atomic operations are used directly, to avoid using the standard spinlock operations. That is, these are distinguished from objects of the types spinlock_t and rwlock_t.

The term "atomic" here is used the strongest sense, meaning that the operations are indivisible and non-interleavable at the hardware level, even by interrupt handlers, and even on an SMP.

Beware: Do-it yourself spinlocks like the one above are less efficient than the optimized in-line machine code of the standard spinlock operations. If you read the separate notes on spinlock information, you saw how care was taken to use data cache to keep a spinning CPU from tying up the (shared) memory bus. You also saw how pains were taken to minimize the length of the most frequently taken (no contention) path through the lock operation.

One of the best uses of bit-level atomic operations on variables is to claim responsibility for doing something. In this application, the bit is set and never cleared again. There is no blocking or spinning. It goes something like this:

if (test_and_set_bit (0 , &(object->claim_bit)) == 0)
      perform_operation_on (object);
/* else we can rely that some other process has already taken
   responsibility for performing the operation on this object */

The atomic increment operations are also useful for keeping counts in certain restricted situations. One situation is where the count is monotonic (and you can show that it will never wrap around). Another is where the count is a reference count, that starts out nonzero and is never incremented again after it goes to zero. This is an extension of the "claiming" idea above. Whatever process decrements the count to zero claims the responsibility for freeing the object.

if (atomic_dec_and_test (&(object->refcount)) == 1)
      destroy (&object);

In general, if you find yourself writing code (in different) places to both set and clear the same bit, or both increment a variable and decrement it or set it back to zero, you are probably in danger of a race between the setting and the clearing, or incrementing and resetting to zero, and should probably be using one of the standard spinlock or wait_event macros.


Going to Sleep Without Races

"...things can happen in the time between when your driver decides to (put the calling process to) sleep and when the sleep_on call is actually performed..."


This is a very important programming solution, for a very general problem. I had to discover it for myself, years ago, since no one ever taught it to me. I learned later that it is essentially the only to solve this problem, and that it is used in virtually every operating system, and in concurrent application code. It is worth really understanding this, and remembering it.

See also note on problems with sleepon in Chapter 5 notes.


Full code of correct solution

wait_queue_t wait;
init_waitqueue_entry (&wait, current);
add_wait_queue (&short_queue, &wait);
while (1) {
   set_current_state (TASK_INTERRUPTIBLE);
   if (short_head != short_tail) /* the wakeup condition */ break;
   schedule;
}
set_current_state (TASK_RUNNING);
remove_wait_queue (&short_queue, &wait);

Understanding wait_event and wait_event_interrupt


If you follow the above links you will find the most relevant parts of the code that implements these functions.

These notes explain the Linux spinlock implementation code for the i386 architecture, which is found in in include/asm-i386/spinlock.h.

The code in the notes below may not exactly match the code accessible via the link above, mainly because the actual code is broken up into a set of macros and the code below has been combined into a single block. There may also have been changes to the Linux distribution code since the notes below were extracted from it. However, any such changes do not reduce the validity of the examples below for illustrating the principles and techniques.


Macros

The implementation uses macros, like the following:

#define spin_lock_init(x)       do { (x)->lock = 0; } while(0)

The above is semantically the same as "x->lock =0".


The "do{ ... }while()" is a syntactic wrapper that prevents spin_lock_init from being used as an expression, i.e., makes sure it belongs to the syntactic category statement. In general, it is important for the execution order of synchronization operations to be uniquely defined, as is the case for statement execution order. In contrast, the compiler is free to reorder the evaluation of subexpressions, subject to the limits of parenthesization.

This is a conventional way of enclosing a block of code in a macro. To see why simple braces "{ ... }" would not do as well, consider the following:

        if (foo)
                spin_lock_init(x);
        else
                do_something_else();

This looks (and should be) perfectly legal, but if you protected the macro only with '{ }' you would get the following:

        if (foo)
                { (x)->lock = 0; };
        else
                do_something_else();

The {} block will be interpreted as the if-block (ok so far). The semicolon will be taken as a null statement. That is, since it is not an "else" the compiler concludes that the "if" doesn't have a matching "else". When it then sees the "else" on the next line it will treat that as a syntax error. A "do {} while(0)" is the correct way of making sure that the semicolon following the macro always gets eaten by the parser. A decent compiler will not generate any extra machine instructions for this degenerate form of loop, so it does not add execution time overhead.


Conditional Compilation

#ifndef __SMP__
...
#define spin_lock_init(lock)	do { } while(0)
...
#else	/* __SMP__ */
...
#define spin_lock_init(x)	do { (x)->lock = 0; } while(0)
...
#endif /* __SMP__ */

If the system is compiled without the symbol __SMP__ defined (the symbol selects support for Symmetric Multi-Processor, or Shared-memory Multi-Processor) support there is no need for spin-locks, so in that case null bodies are provided for all the operations on spinlocks.


Representation of Spin Lock Object

typedef struct {
	volatile unsigned int lock;
} spinlock_t;

The value of x.lock is an integer. The value is zero if-and-only-if the lock is in the unlocked stated.


What does the word "volatile" mean here? Why is this essential?


The Spin Lock Operation in Machine Code

#define spin_lock_string \
	"\n1:\t" \
	"lock ; btsl $0,%0\n\t" \
	"jc 2f\n" \
	".section .text.lock,\"ax\"\n" \
	"2:\t" \
	"testb $1,%0\n\t" \
	"jne 2b\n\t" \
	"jmp 1b\n" \
	".previous"

The Spin Lock Operation in Machine Code - Expanded

If the new-line and tab characters are expanded, the above in-line assembly language code appears as follows:

1:                        // a label
   lock                   // makes the next instruction atomic
   btsl $0,%0             // bit test and set long
   jc 2f                  // 
.section .text.lock,"ax"
2:
  testb $1,%0
  jne 2b
  jmp 1b
.previous

The Bit Test and Set Long Instruction

The effect of the instruction is explained in terms of register transfers as follows:

CF <- Bit(BitBase, BitOffset)
Bit(BitBase, BitOffset) <- 1

That is, the Carry Flag (CF) is set if the value of the 0th bit of register operand %0 before the instruction executes is 1, and in any case sets the 0th bit of %0 to 1. Here, $0 denotes an immediate operand with value zero.

Thus, the net effect of the first three instructions is to jump to label 2 if the lock is not free, and otherwise seize the lock.


Trick to Reorder Code

.section .text.lock,"ax"

The above pseudo-operation causes the assembler to put the following block of code, up to the next ".previous" pseudo-operation, out-of-line, in a different text (i.e., code) section. (The "ax" has nothing to do with the EAX register. The flag "x" means executable, and the "a" means allocatable. ) We expect that usually the lock will be free and execution will fall through the "jc 2f" instruction. When it does it will fall through to the block of code following the ".previous" pseudo-operation. It is because the target is in a different text segment that the conditional jump uses a long address, indicated by the letter "l".


The Spin Loop

2:
  testb $1,%0    // compare the low-order byte against 1
  jne 2b         // continue looping if the value is still 1
  jmp 1b         // otherwise, go back to label 1 and retry

The above code does the "spinning", i.e., it repeated checks the value of the lock until the holder frees it, which the holder will do by setting the value of the lock to zero. This could have been done using a simple loop on the btsl, but each time that instruction executes is forces a "locked cycle" on the memory bus, which hurts performance for all CPUs. It is is much more efficient to just do a normal fetch on the value of the lock until it is updated, since the fetch will come from the local cache of the CPU until the value in memory is changed by the lock holder. The CPU has bus snooping logic that will notice the write, update (or invalidate) the local cache, and cause the new value of the lock to be fetched when it has changed. Of course, another CPU might be also trying to grab the lock, so it is necessary to go back to the btsl in order to repeat the attempt at atomically seizing the lock.


Weaknesses of Spin Locks

This kind of lock is not perfect. In particular, it has the following problems:

  1. While a CPU is waiting for the lock, it is doing no useful work.
  2. Without more information about the hardware, it is indeterminate which CPU will get the lock. In principle, if there is a high degree of contention for the lock, this permits unbounded blocking.
  3. Priority is not taken into account. That is, if jobs on several CPU's are contending for the lock, which CPU gets the lock is not determined by the priority of the job that is running on the CPU. With priority-based scheduling algorithms, such priority inversion is undesirable, because it reduces the schedulable CPU utilization.

The effects of problem (1) can be bounded if the lengths of the critical sections (blocks of code over which the lock is held) for each lock are kept very short, and the frequencies of execution of critical sections are kept low.

Problems (2) and (3) can be addressed by using a more sophisticated locking algorithm, that takes into account priorities. It is an interesting question how much additional execution time overhead is incurred by using such an algorith (which is necessarily somewhat slower), and how the processor utilization lost due to this overhead compares against the gain in utilizable processing capacity due to reduced priority inversion.


The Unlock Operation

#define spin_unlock_string \
	"lock ; btrl $0,%0"

The above Bit Test and Reset (btrl) instruction stores the value of the zeroth bit of the operand into the CF flag, and clears the bit in the memory operand.

© 1999, 2001, 2003, 2004 T. P. Baker. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means without written permission. (Last updated by $Author: baker $ on $Date: 2010/06/07 14:29:15 $.)
© 2003, 2005 T. P. Baker. ($Id)