This timely text presents a comprehensive overview of fault
tolerance techniques for high-performance computing (HPC). The text
opens with a detailed introduction to the concepts of checkpoint
protocols and scheduling algorithms, prediction, replication,
silent error detection and correction, together with some
application-specific techniques such as ABFT. Emphasis is placed on
analytical performance models. This is then followed by a review of
general-purpose techniques, including several checkpoint and
rollback recovery protocols. Relevant execution scenarios are also
evaluated and compared through quantitative models. Features:
provides a survey of resilience methods and performance models;
examines the various sources for errors and faults in large-scale
systems; reviews the spectrum of techniques that can be applied to
design a fault-tolerant MPI; investigates different approaches to
replication; discusses the challenge of energy consumption of
fault-tolerance methods in extreme-scale systems.
General
Is the information for this product incomplete, wrong or inappropriate?
Let us know about it.
Does this product have an incorrect or missing image?
Send us a new image.
Is this product missing categories?
Add more categories.
Review This Product
No reviews yet - be the first to create one!