[PDF][PDF] Toward exascale resilience: 2014 update

F Cappello, A Geist, W Gropp, S Kale… - Supercomputing …, 2014 - superfri.org
Resilience is a major roadblock for HPC executions on future exascale systems. These
systems will typically gather millions of CPU cores running up to a billion threads …

Design, modeling, and evaluation of a scalable multi-level checkpointing system

A Moody, G Bronevetsky, K Mohror… - SC'10: Proceedings …, 2010 - ieeexplore.ieee.org
High-performance computing (HPC) systems are growing more powerful by utilizing more
hardware components. As the system mean-time-before-failure correspondingly drops …

Hot sax: Efficiently finding the most unusual time series subsequence

E Keogh, J Lin, A Fu - … Conference on Data Mining (ICDM'05), 2005 - ieeexplore.ieee.org
In this work, we introduce the new problem of finding time series discords. Time series
discords are subsequences of a longer time series that are maximally different to all the rest …

Online-ABFT: An online algorithm based fault tolerance scheme for soft error detection in iterative methods

Z Chen - ACM SIGPLAN Notices, 2013 - dl.acm.org
Soft errors are one-time events that corrupt the state of a computing system but not its overall
functionality. Large supercomputers are especially susceptible to soft errors because of their …

Condition numbers of Gaussian random matrices

Z Chen, JJ Dongarra - SIAM Journal on Matrix Analysis and Applications, 2005 - SIAM
Let G_m*n be an m*n real random matrix whose elements are independent and identically
distributed standard normal random variables, and let \kappa_2(G_m*n) be the 2-norm …

Algorithm-based fault tolerance for fail-stop failures

Z Chen, J Dongarra - IEEE Transactions on Parallel and …, 2008 - ieeexplore.ieee.org
Fail-stop failures in distributed environments are often tolerated by checkpointing or
message logging. In this paper, we show that fail-stop process failures in ScaLAPACK matrix …

High performance linpack benchmark: a fault tolerant implementation without checkpointing

T Davies, C Karlsson, H Liu, C Ding… - Proceedings of the …, 2011 - dl.acm.org
The probability that a failure will occur before the end of the computation increases as the
number of processors used in a high performance computing application increases. For long …

Algorithm-based recovery for iterative methods without checkpointing

Z Chen - Proceedings of the 20th international symposium on …, 2011 - dl.acm.org
In today's high performance computing practice, fail-stop failures are often tolerated by
checkpointing. While checkpointing is a very general technique and can often be applied to …

The reliability wall for exascale supercomputing

X Yang, Z Wang, J Xue, Y Zhou - IEEE Transactions on …, 2011 - ieeexplore.ieee.org
Reliability is a key challenge to be understood to turn the vision of exascale supercomputing
into reality. Inevitably, large-scale supercomputing systems, especially those at the …

Algorithm-based checkpoint-free fault tolerance for parallel matrix computations on volatile resources

Z Chen, J Dongarra - Proceedings 20th IEEE International …, 2006 - ieeexplore.ieee.org
As the size of today's high performance computers increases from hundreds, to thousands,
and even tens of thousands of processors, node failures in these computers are becoming …