Lecture 15: Fault Tolerance, 2-Phase Commit

These topics are from Chapter 13 (Fault Tolerance) in Advanced Concepts in OS.

Topics for Today

What is Fault Tolerance?

Issues in Fault Tolerance

Atomic Actions

appear to other processes as if they were

Transaction

Transaction: Example

Process A                |  Process B
---------                |  ---------
...                      |  ...
Lock (X); Lock (Y);      |  Lock (X); Lock (Y);
Tmp := X;                |  Tmp := X;
X := Y;                  |  X := Y;
Y := Tmp;                |  Y := Tmp;
Unlock (Y); Unlock (X);  |  Unlock (Y); Unlock (X);
...                      |  ...

2-Phase Commit

General's Paradox

This is an abstraction of a core problem in the design of commit protocols.

There is no adequate protocol which sends the messengers a fixed number of times.

Why?


Nonexistence Argument

An adequate protocol may require an unbounded number of messages,
in the presence of an unbounded number of lost messages.

2-Phase Commit Protocol

2-Phase Commit Coordinator, with States

The "state" of the protocol is an abstraction of the location in the code where it is executing and the values of local variables. The following coding of the algorithm assigns a value to a variable State to make the notion of state explicit. Transitions are triggered by receipt of (sets of) messages, and result in the transmission of messages.

   State := q_1;
   send COMMIT_REQUEST message to every cohort;
   State := w_1;
   wait for replies from all cohorts;
   if some cohort replied ABORT then send ABORT to all cohorts;
      State := a_1;
   else
      write COMMIT record to log;
      send COMMIT message to all cohorts;
      State := c_1;
   end if;
   loop
      wait for replies, with timeout;
      exit when all cohorts have replied;
      if State = a_1 then 
         resend ABORT message to cohorts that have not yet replied;
      else
         resend COMMIT message to cohorts that have not yet replied;
      end if;
   end loop;
   write COMPLETE record to log;
   State := f_1;
   

2-Phase Commit Cohorts, with States

   State := q_i;
   await COMMIT_REQUEST message;
   if transaction successful then
      write UNDO and REDO records to log;
      send AGREED message to coordinator;
      State := w_i;
   else send ABORT message to coordinator;
      State := a_i;
   end if;
   await COMMIT/ABORT message;
   if message = ABORT then
      undo transaction, using UNDO log record;
      release all resources and locks for this transaction;
      send ACK to coordinator;
      State := b_i;
   else release all resources and locks for this transaction;
      send ACK to coordinator;
      State := c_i;
   end if;

Why is there provision for timeout and resending of messages just in one place? What about the other places where a site is awaiting a message? For example, what happens if a cohort failure causes delay in replying to a COMMIT_REQUEST message? Should the coordinator time out? How should it recover?

Site failures:

2-Phase Commit Protocol

What is the different between what we are accomplishing with a commit protocol in the current context and what we were accomplishing with the Byzantine Agreement protocols?

State Diagrams

The above code can be further abstracted to the following state transition diagrams.

The diagrams use the following notation:

The state transitions correspond to message send and receive events.

What is a finite automaton?

Synchronous Property

One site never gets more than one state transition ahead of the rest of the participating sites.

The 2-Phase Commit Protocol has this synchronous property. In each state a site waits for replies from all the sites to which it sent messages in the previous state transition, before it makes the transition to its next state.

This property makes it easier to analyze the effects of the protocol, because it reduces the number of global states (combinations of local states) we need to consider.

Local Failures and Timeouts

We can augment the state diagram to include transitions for timeouts (T) and recovery after a local site failure (F), as follows.