The landscape of exascale research: A data-driven literature analysis

S Heldens, P Hijma, BV Werkhoven… - ACM Computing …, 2020 - dl.acm.org
The next generation of supercomputers will break the exascale barrier. Soon we will have
systems capable of at least one quintillion (billion billion) floating-point operations per …

A checkpoint of research on parallel i/o for high-performance computing

FZ Boito, EC Inacio, JL Bez, POA Navaux… - ACM Computing …, 2018 - dl.acm.org
We present a comprehensive survey on parallel I/O in the high-performance computing
(HPC) context. This is an important field for HPC because of the historic gap between …

Performance optimality or reproducibility: that is the question

T Patki, JJ Thiagarajan, A Ayala, TZ Islam - Proceedings of the …, 2019 - dl.acm.org
The era of extremely heterogeneous supercomputing brings with itself the devil of increased
performance variation and reduced reproducibility. There is a lack of understanding in the …

A systematic survey on fault-tolerant solutions for distributed data analytics: Taxonomy, comparison, and future directions

S Isukapalli, SN Srirama - Computer Science Review, 2024 - Elsevier
Fault tolerance is becoming increasingly important for upcoming exascale systems,
supporting distributed data processing, due to the expected decrease in the Mean Time …

EReinit: Scalable and efficient fault‐tolerance for bulk‐synchronous MPI applications

S Chakraborty, I Laguna, M Emani… - Concurrency and …, 2020 - Wiley Online Library
Scientists from many different fields have been develo** Bulk‐Synchronous MPI
applications to simulate and study a wide variety of scientific phenomena. Since failure rates …

Exploring energy saving opportunities in fault tolerant HPC systems

M Morán, J Balladini, D Rexachs, E Rucci - Journal of Parallel and …, 2024 - Elsevier
Nowadays, improving the energy efficiency of high-performance computing (HPC) systems
is one of the main drivers in scientific and technological research. As large-scale HPC …

Prediction of energy consumption by checkpoint/restart in HPC

M Morán, J Balladini, D Rexachs, E Luque - IEEE Access, 2019 - ieeexplore.ieee.org
The fault tolerance method most used today in high-performance computing (HPC) is
coordinated checkpointing. This, like any other fault tolerance method, adds additional …

Optimizing checkpoint intervals for reduced energy use in exascale systems

D Dauwe, R Jhaveri, S Pasricha… - 2017 Eighth …, 2017 - ieeexplore.ieee.org
In today's high performance computing (HPC) systems, the probability of applications
experiencing failures has increased significantly with the increase in the number of system …

Fault-tolerant regularity-based real-time virtual resources

AMK Cheng, G Dai, PK Paluri, M Ansari… - 2019 IEEE 25th …, 2019 - ieeexplore.ieee.org
Many safety-critical applications employ embedded real-time systems where both timing and
fault tolerance requirements must be continually satisfied. The Regularity-based Resource …

Exploiting Efficiency Opportunities Based on Workloads with Electron on Heterogeneous Clusters

R DelValle, P Kaushik, A Jain, J Hartog… - Proceedings of the10th …, 2017 - dl.acm.org
Resource Management tools for large-scale clusters and data centers typically schedule
resources based on task requirements specified in terms of processor, memory, and disk …