Lecture #14: Recovery

Recovery in Distributed/Concurrent Systems

A Simple Method for taking a Consistent Set of Checkpoints

Synchronous Checkpointing Algorithm (Koo and Toueg)

Synchronous Checkpointing: Phase 1

Synchronous Checkpointing: Phase 2

Between tentative checkpoint and commit/abort of checkpoint process must hold back messages.

Does this guarantee we have a strongly consistent state? Can you construct an example that shows we can still have lost messages?

Synchronous Checkpointing: Properties

Checkpoints may be taken unnecessarily, give an example.

Can this unnecessarily checkpoints to avoided? A scheme is described in the book. Main idea

Rollback Recovery: Phase 1

Rollback Recovery: Phase 2

Between request to rollback and decision, no one sends other messages

Rollback Recovery: Properties

Can have unnecessary rollback: can use a similar technique as the one in taking checkpoints to eliminate unnecessary rollback. Discuss

Disadvantages of Synchronous Approach

These costs may seem high if failures between checkpoints are unlikely.

Asynchronous Approach

Why is the second approach called optimistic?

What are the advantages and disadvantages of each approach?

Juang & Venkatesan Asynchronous Checkpointing Algorithm

make some simplifying assumptions

basic idea:

Algorithm: all node will be running the same recovering algorithm (how to make this happen?) At processor i:

In each iteration, at least one processor will rollback to its final recovery point unless current recovery point is consistent

UNIX file system and file system error recovery(fsck)

Unix Filesystem Structure

Information from M. J. Bach's Design of the Unix Operating System.

For fault tolerance, redundant copies of the superblock are stored, on different cylinders and different platters of the disk drive. This reduces the chance that a disk media failure will result in corruption of the entire file system.

Contents of Superblock

Contents of an Inode

Relationship of inodes, Direct Blocks, and Indirect Blocks

Disk Error Recovery: fsck

The program fsck checks a filesystem for inconsistencies, and then attempts to repair them.

  1. block belongs to more than one inode
  2. block belongs to an inode and the list of free inodes
  3. block is not on free list and not in a file
  4. non-zero link count but not in any directories
  5. free inode found in directory
  6. in general: more/less directory links than link count
  7. format of inode incorrect
  8. count of free blocks in super block does not match the number on disk
  9. count of free inodes in super block does not match the number on disk
  10. How could each of these situations arise?

    How might each of these situations be repaired?