A survey on checkpointing strategies: Should we always checkpoint à la Young/Daly?

L Bautista-Gomez, A Benoit, S Di, T Herault… - Future Generation …, 2024 - Elsevier
Abstract The Young/Daly formula provides an approximation of the optimal checkpointing
period for a parallel application executing on a supercomputing platform. It was originally …

Addressing failures in exascale computing

M Snir, RW Wisniewski, JA Abraham… - … Journal of High …, 2014 - journals.sagepub.com
We present here a report produced by a workshop on 'Addressing failures in exascale
computing'held in Park City, Utah, 4–11 August 2012. The charter of this workshop was to …

[BUKU][B] Fault tolerance techniques for high-performance computing

J Dongarra, T Herault, Y Robert - 2015 - Springer
This chapter provides an introduction to resilience methods. The emphasis is on
checkpointing, the de-facto standard technique for resilience in High Performance …

Lessons learned from the analysis of system failures at petascale: The case of blue waters

C Di Martino, Z Kalbarczyk, RK Iyer… - 2014 44th Annual …, 2014 - ieeexplore.ieee.org
This paper provides an analysis of failures and their impact for Blue Waters, the Cray hybrid
(CPU/GPU) supercomputer at the University of Illinois at Urbana-Champaign. The analysis …

Dare: High-performance state machine replication on rdma networks

M Poke, T Hoefler - Proceedings of the 24th International Symposium on …, 2015 - dl.acm.org
The increasing amount of data that needs to be collected and analyzed requires large-scale
datacenter architectures that are naturally more susceptible to faults of single components …

Diagnosing performance variations in HPC applications using machine learning

O Tuncer, E Ates, Y Zhang, A Turk, J Brandt… - … Conference, ISC High …, 2017 - Springer
With the growing complexity and scale of high performance computing (HPC) systems,
application performance variation has become a significant challenge in efficient and …

Failure prediction for HPC systems and applications: Current situation and open issues

A Gainaru, F Cappello, M Snir… - … International journal of …, 2013 - journals.sagepub.com
As large-scale systems evolve towards post-petascale computing, it is crucial to focus on
providing fault-tolerance strategies that aim to minimize fault's effects on applications. By far …

A shoulder surfing resistant graphical authentication system

HM Sun, ST Chen, JH Yeh… - IEEE Transactions on …, 2016 - ieeexplore.ieee.org
Authentication based on passwords is used largely in applications for computer security and
privacy. However, human actions such as choosing bad passwords and inputting passwords …

Fault prediction under the microscope: A closer look into HPC systems

A Gainaru, F Cappello, M Snir… - SC'12: Proceedings of …, 2012 - ieeexplore.ieee.org
A large percentage of computing capacity in today's large high-performance computing
systems is wasted because of failures. Consequently current research is focusing on …

Reading between the lines of failure logs: Understanding how HPC systems fail

N El-Sayed, B Schroeder - 2013 43rd annual IEEE/IFIP …, 2013 - ieeexplore.ieee.org
As the component count in supercomputing installations continues to increase, system
reliability is becoming one of the major issues in designing HPC systems. These issues will …