Lecture #13: Recovery

These topics are from Chapter 12 (Recovery) in Advanced Concepts in OS.


Topics in this Chapter

Terminology

Recovery

Failure Classification

What are some causes for each?

Tolerating Process Failures

What are some situations where each is appropriate?

Recovering from System Failures

Tolerating Secondary Storage Failures

Tolerating Communication Medium Failures

Backward versus Forward Error Recovery

Backward Error Recovery

System Model

Stable Storage

Two Approaches to Fault Tolerance

Practical systems employ a combination of the two approaches, e.g., logging with periodic full-DB snapshots for archive.

Fundamental Issues in Crash Recovery

The textbook jumps right into the problem of supporting crash recovery, without first reviewing any basic transaction models. The following are two more basic models than those mentioned in the text.

Basic Deferred-Update Model

Basic Update-In-Place Model

What provides for disk crash recovery?

Extended Update-In-Place Model

Where does the stable storage fit in?

Crash Recovery with Update-In-Place

We now have a way to reconstruct the DB system in event of a crash, starting from an archived snapshot and the subsequent log:

If we are starting with a snapshot, why do we need to worry about active, uncommitted, and aborted transactions?

Problem: DB write before log write

There is a defect in the above scheme

How to solve?

Solution: Write-Ahead-Log

Before a block is written to DB disk, make sure the corresponding undo record is completely written to the log disk.

The log must be forced to disk as part of committing a transaction.

Crash Recovery with Write-Ahead-Log

State Based Approach

Problems in Distributed/Concurrent Systems

Orphan Messages

Note domino effect if Z is rolled back

Lost Messages

What is the difference between a lost message and an orphan message?

Livelock

These all motivate need for coordinating checkpoints & recovery

Strongly Consistent Set of Checkpoints

There is no information flow between any processes in the set during the time interval spanned by the checkpoints, i.e., no messages in transit.

Consistent Set of Checkpoints

There may be information flow between the processes, but each message recorded as received should be recorded as sent. That is, there are no orphan messages.

What is the remaining problem here?


Observations