ICS Theory Group

ICS 269, Winter 1999: Theory Seminar


12 February 1999:
A Survey of Rollback-Recovery Protocols in Message-Passing Systems by: E.N. Elnozahy, D.B. Johnson, Y.M. Wang
Thuan Do, ICS, UC Irvine

Abstract: The problem of rollback-recovery in message-passing systems has undergone extensive study. In this survey, we review rollback-recovery techniques that do not require special language constructs, and classify them into two primary categories. Checkpoint-based rollback-recovery relies solely on checkpointed states for system state restoration. Depending on when checkpoints are taken, existing approaches can be divided into uncoordinated checkpointing, coordinated checkpointed and communication-induced checkpointing. Log-based rollback-recovery uses checkpointing and message logging. There are three different log-based approaches, namely, pessimistic logging, optimistic logging and causal logging. We identify a set of desirable properties of rollback-recovery protocols, and compare different approaches with respect to these properties. Log-based rollback-recovery protocols generally rely on the assumption of piecewise determinism and pay additional overhead to allow faster output commits and more localized recovery. We present research issues under each approach, and review existing solutions to address them. We also present implementation issues of checkpointing and message logging.