Discover

APPLICATION_CHECKPOINTING

(Redirected from Checkpointing)
'''Checkpointing''' is a technique for inserting fault tolerance into computing systems. It basically consists on storing a snapshot of the current application state, and use it for restarting the execution in case of failure.

Contents
Checkpointing techniques properties
Checkpointing in distributed shared memory systems
Practical implementatations of Checkpointing for Linux/UNIX
References

Checkpointing techniques properties


There are many different points of view and techniques for achieving application checkpointing. Depending on the specific implementation, a tool can be classified attending to several properties:

★ ''Amount of state saved'': This property refers to the abstraction level used by the technique to analyze an application. It can range from seeing each application as a black box, hence storing all application data, to selecting specific relevant cores of data in order to achieve a more efficient and portable operation.

★ ''Automatization level'': Depending on the effort needed to achieve fault tolerance through the use of a specific checkpointing solution.

★ ''Portability'': Whether or not the saved state can be used on different machines to restart the application.

★ ''System architecture'': How is the checkpointing technique implemented: inside a library, by the compiler or at operating system level.
Each design decision made affects the properties and efficiency of the final product. For instance, deciding to store the entire application state will allow for a more straightforward implementation, since no analysis of the application will be needed, but it will deny the portability of the generated state files, due to a number of non-portable structures (such as application stack or heap) being stored along with application data.

Checkpointing in distributed shared memory systems


In Distributed shared memory, checkpointing is a technique that helps tolerate the errors leading to losing the effect of work of long-running applications. The main property which should be induced by checkpointing techniques in such systems is in preserving system consistency in case of failure. There are two main approaches to checkpointing in such systems: coordinated checkpointing, in which all cooperating processes work together to establish coherent checkpoint; and communication induced (called also dependency induced) independent checkpointing.
It must be stressed that simply forcing processes to checkpoint their state at fixed time intervals is not sufficient to ensure global consistency. Even if we postulate the existence of global clock, the checkpoints made by different processes still may not form a consistent state. The need for establishing a consistent state may force other process to roll back to their checkpoints, which in turn may cause other processes to roll back to even earlier checkpoints, which in the most extreme case may mean that the only consistent state found is the initial state (the so called ''domino effect'').
In the coordinated checkpointing approach, processes must ensure that their checkpoints are consistent. This is usually achieved by some kind of two-phase commit algorithm. In communication induced checkpointing, each process checkpoints its own state independently whenever this state is exposed to other processes (that is, for example whenever a remote process reads the page written to by the local process).
The system state may be saved either locally, in stable storage, or in a distant node's memory.

Practical implementatations of Checkpointing for Linux/UNIX


A number of practical checkpointing packages have been developed for the Linux/UNIX family of operating systems. These checkpointing packages may be divided into two classes, those which operate in user space, examples of which include the checkpointing package used by Condor and the
portable checkpointing library developed by The University of Tennessee. User space checkpointing
pacakages are highly portable and can typically be compiled and run on any modern UNIX (e.g. Linux,
FreeBSD, OpenBSD,
Darwin etc). In contrast, kernel based checkpointing packages such as Chpox and Cryopid, and the checkpointing algorithms developed for the MOSIX cluster computing environment tend to be highly operating system dependent. Most kernel based checkpointing packages developed to date run under either the 2.4 or 2.6 subfamilies of the Linux kernel on i686 architectures.
Modern checkpointing packages such as 'Cryopid' are capable of checkpointing a '''process pod''', that is a parent process and all its associated children, and of dealing with file system abstractions such as sockets and pipes (FIFO's) in addition to regular files. In the case of Cryopid, there is also provision to roll all dynamic libraries, open files, sockets and FIFO's associated with the process into the checkpoint. This is very useful when the checkpointed process is to be restarted in a hetrogenous environment (e.g. the machine on which the checkpoint is restarted has libraries and file system which differ from the host on which the process was checkpointed).

References



★ E.N. Elnozahy, L. Alvisi, Y-M. Wang, and D.B. Johnson, "A survey of rollback-recovery protocols in message-passing systems", ''ACM Comput. Surv.'', vol. 34, no. 3, pp. 375-408, 2002.

The Home of Checkpointing Packages

★ Yibei Ling, Jie Mi, Xiaola Lin: A Variational Calculus Approach to Optimal Checkpoint Placement. IEEE Trans. Computers 50(7): 699-708 (2001)

This article provided by Wikipedia. To edit the contents of this article, click here for original source.

psst.. try this: add to faves