Fault tolerance in petascale/exascale systems: Current knowledge, challenges and research opportunities

F Cappello - The International Journal of High Performance …, 2009‏ - journals.sagepub.com
The emergence of petascale systems and the promise of future exascale systems have
reinvigorated the community interest in how to manage failures in such systems and ensure …

Addressing failures in exascale computing

M Snir, RW Wisniewski, JA Abraham… - … Journal of High …, 2014‏ - journals.sagepub.com
We present here a report produced by a workshop on 'Addressing failures in exascale
computing'held in Park City, Utah, 4–11 August 2012. The charter of this workshop was to …

Open MPI: Goals, concept, and design of a next generation MPI implementation

E Gabriel, GE Fagg, G Bosilca, T Angskun… - Recent Advances in …, 2004‏ - Springer
A large number of MPI implementations are currently available, each of which emphasize
different aspects of high-performance computing or are intended to solve a specific research …

A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems

IP Egwutuoha, D Levy, B Selic, S Chen - The Journal of Supercomputing, 2013‏ - Springer
Abstract In recent years, High Performance Computing (HPC) systems have been shifting
from expensive massively parallel architectures to clusters of commodity PCs to take …

DMTCP: Transparent checkpointing for cluster computations and the desktop

J Ansel, K Arya, G Cooperman - 2009 IEEE international …, 2009‏ - ieeexplore.ieee.org
DMTCP (distributed multithreaded checkpointing) is a transparent user-level checkpointing
package for distributed applications. Checkpointing and restart is demonstrated for a wide …

Toward exascale resilience

F Cappello, A Geist, B Gropp, L Kale… - … Journal of High …, 2009‏ - journals.sagepub.com
Over the past few years resilience has became a major issue for high-performance
computing (HPC) systems, in particular in the perspective of large petascale systems and …

Proactive fault tolerance for HPC with Xen virtualization

AB Nagarajan, F Mueller, C Engelmann… - Proceedings of the 21st …, 2007‏ - dl.acm.org
Large-scale parallel computing is relying increasingly on clusters with thousands of
processors. At such large counts of compute nodes, faults are becoming common place …

Reducing costs of spot instances via checkpointing in the amazon elastic compute cloud

S Yi, D Kondo, A Andrzejak - 2010 IEEE 3rd International …, 2010‏ - ieeexplore.ieee.org
Recently introduced spot instances in the Amazon Elastic Compute Cloud (EC2) offer lower
resource costs in exchange for reduced reliability; these instances can be revoked abruptly …

Open MPI: A flexible high performance MPI

RL Graham, TS Woodall, JM Squyres - Parallel Processing and Applied …, 2006‏ - Springer
A large number of MPI implementations are currently available, each of which emphasize
different aspects of high-performance computing or are intended to solve a specific research …

The trouble with memes: Inference versus imitation in cultural creation

S Atran - Human nature, 2001‏ - Springer
Memes are hypothetical cultural units passed on by imitation; although nonbiological, they
undergo Darwinian selection like genes. Cognitive study of multimodular human minds …