Lecture #3: Theoretical Foundations --
Clocks in a Distributed Environment
Topics for today
- Some inherent limitations of a distributed system and their implication.
- Lamport logical clocks
- Vector clocks
These topics are from Chapter 5-5.4 in Advanced Concepts in
OS.
Distributed systems
- A collection of computers that do not share a common clock and a common
memory
- Processes in a distributed system exchange information over the communication channel, the message delay is unpredictable.
Inherent limitations of a distributed system
Absence of a global clock
Distributed processes cannot rely on having an accurate
view of global state, due to transmission delays.
Effectively, we cannot talk meaningfully about global state.
The traditional notions of "time" and "state" do not work in distributed
systems. We need to develop some concepts that are corresponding to "time"
and "state" in a uniprocessor system.
Lamport's logical clocks
- the "time" concept in distributed systems -- used to order events in a distributed system.
- assumption:
- the execution of a process is characterized by a sequence of events.
An event can be the execution of one instruction or of one procedure.
- sending a message is one event, receiving a message is one event.
- The events in a distributed system are not total chaos. Under some
conditions, it is possible to ascertain the order of the events. Lamport's
logical clocks try to catch this.
Lamport's ``happened before'' relation
The ``happened before'' relation (®) is defined as follows:
- A ® B if A and B are within the same process
(same sequential thread of control) and A occurred before B.
- A ® B if A is the event of sending a message M
in one process and B is the event of receiving M by another
process
- if A ® B and B ® C then A ® C
Event A causally affects event B iff A ® B.
Distinct events A and B are concurrent (A | | B) if we do not have
A ® B or B ® A.
Lamport Logical Clocks
- are local to each process (processor?)
- do not measure real time
- only measure ``events''
- are consistent with the happened-before relation
- are useful for totally ordering transactions,
by using logical clock values as timestamps
Logical Clock Conditions
C_{i} is the local clock for process P_{i}
- if a and b are two successive events in P_{i}, then
C_{i}(b) = C_{i}(a) + d_{1}, where d_{1} > 0
- if a is the sending of message m by P_{i}, then
m is assigned timestamp t_{m} = C_{i}(a)
- if b is the receipt of m by P_{j},
then
C_{j}(b) = max{C_{j}(b), t_{m} + d_{2}}, where d_{2} > 0
Logical Clock Conditions
The value of d could be 1, or it could be an approximation
to the elapsed real time. For example, we could take d_{1} to
be the elapsed local time, and d_{2} to be the estimated
message transmission time. The latter solves the problem of
waiting forever for a virtual time instant to pass.
Total Ordering
We can extend the partial ordering of the happened-before
relation to a total ordering on ervents, by using the logical
clocks and resolving any ties by an arbitrary rule based on the
processor/process ID.
If a is an event in P_{i} and b is in P_{j}, aÞ b iff
- C_{i}(a)< C_{j}(b) or
- C_{i}(a)=C_{j}(b) and P_{i} < P_{j}
where < is an arbitrary total ordering
of the processes
How useful is this? How close to real time?
Example of Lamport Logical Clocks
C(a) < C(b) does not imply a ® b
That is, the ordering we get from Lamport's clocks
is not enough to guarantee that if two events precede one
another in the ordering relation they are also causally related.
The following Vector Clock scheme is intended to improve on this.
Vector Clocks
- Clock values are vectors
- Vector length is n, the number of processes
- C_{i}[i](a) = local time of P_{i} at event a
- C_{i}[j](a) = time C_{j}[j](b) of last event b at P_{j}
that is known to happen before local event a
Vector Clock Algorithm
- if a and b are successive events in P_{i}, then
C_{i}[i](b) = C_{i}[i](a) + d_{1}
- if a is sending of m by P_{i} with vector timestamp t_{m}
b is receipt of m by P_{j} then
C_{j}[k](b) = max{C_{j}[k](b), t_{m}[k]}
Vector Clock Ordering Relation
- t = t¢Û"i t[i] = t¢[i]
- t ¹ t¢Û$i t[i] ¹ t¢[i]
- t £ t¢Û"i t[i] £ t¢[i]
- t < t¢Û(t £ t¢and t ¹ t¢)
- t | | t¢Ûnot (t < t¢or t¢ < t)
The relation £ defined above is a partial ordering.
Vector Clocks
- a ® b if t^{a} < t^{b}
- b ® a if t^{b} < t^{a}
- otherwise a and b are concurrent
This is not a total ordering, but it is sufficient
to guarantee a causal relationship, i.e.,
a ® b iff t^{a} < t^{b}
How scalable is this?
Figure 5.5 in the book.
Non-causal Ordering of Messages
Message delivery is said to be causal if the order in which
messages are received is consistent with the order in which they
are sent. That is, if Send(M_{1}) ® Send (M2) then
for every recipient of both messages, M_{1} is received before
M_{2}.
Enforcing Causal Ordering of Messages
Basic idea: Buffer each message until the message that
immediately precedes it is delivered.
The text describes two protocols for implementing this idea:
- Birman-Shiper-Stephenson: uses all broadcast messages
- Shiper-Eggli-Sandoz: does not have this restriction
Note: These methods serialize the actions of the
system. That makes the behavior more predictable, but also
may mean loss of performance, due to idle time.
That, plus scaling problems, means these algorithms
are not likely to be of much use for high-performance
computing.
Birman-Shiper-Stephenson Causal Message Ordering
- Before P_{i} broadcasts m, it increments VT_{Pi}[i]
and timestamps m. Thus VT_{Pi}[i]-1 is the number of
messages from P_{i} preceding m.
- When P_{j} (j ¹ i) receives message
m with timestamp VT_{m} from P_{i}, delivery is delayed locally
until both of the following are satisfied:
- VT_{Pj}[i] = VT_{m}[i] - 1
- VT_{Pj}[k] ³ VT_{m}[k] for all k ¹ i
Delayed messages are queued at each process, sorted by their
vector timestamps, with concurrent messages ordered by time of
receipt.
- When m is delivered to P_{j},
VT_{Pj} is updated as usual for vector clocks.
Schiper-Eggli-Sandoz Protocol
Generalizes the above, so that messages do not need to be
broadcast, but are just sent between pairs of processes, and
the communication channels do not need to be FIFO.
How would you implement and test the above algorithms?