Predictive reliability and fault management in exascale systems: State of the art and perspectives

R Canal, C Hernandez, R Tornero, A Cilardo… - ACM Computing …, 2020 - dl.acm.org
Performance and power constraints come together with Complementary Metal Oxide
Semiconductor technology scaling in future Exascale systems. Technology scaling makes …

A survey on malleability solutions for high-performance distributed computing

JI Aliaga, M Castillo, S Iserte, I Martín-Álvarez… - Applied Sciences, 2022 - mdpi.com
Maintaining a high rate of productivity, in terms of completed jobs per unit of time, in High-
Performance Computing (HPC) facilities is a cornerstone in the next generation of exascale …

Basis path coverage testing of MPI programs based on multi-task evolutionary optimization

B Sun, L Gong, Y Guo, D Gong - Expert Systems with Applications, 2024 - Elsevier
Abstract A Message-Passing Interface (MPI) program usually consists of several processes,
and a target path of this program is composed of a target sub-path selected in each process …

[HTML][HTML] An efficient ant colony optimization framework for HPC environments

P González, RR Osorio, XC Pardo, JR Banga… - Applied Soft …, 2022 - Elsevier
Combinatorial optimization problems arise in many disciplines, both in the basic sciences
and in applied fields such as engineering and economics. One of the most popular …

Resilience and fault tolerance in high-performance computing for numerical weather and climate prediction

T Benacchio, L Bonaventura… - … Journal of High …, 2021 - journals.sagepub.com
Progress in numerical weather and climate prediction accuracy greatly depends on the
growth of the available computing power. As the number of cores in top computing facilities …

Strategies for fault-tolerant tightly-coupled hpc workloads running on low-budget spot cloud infrastructures

V Munhoz, M Castro… - 2022 IEEE 34th …, 2022 - ieeexplore.ieee.org
Cloud providers can rent their spare computing capacity at substantial discounts, reclaiming
it whenever there is a more profitable higher-priority request-a business model well known …

Frontier vs the Exascale report: Why so long? And are we really there yet?

PM Kogge, WJ Dally - 2022 IEEE/ACM International Workshop …, 2022 - ieeexplore.ieee.org
Now that the exascale Frontier is here, it is instructive to compare its properties to those
projected in the 2008 Exascale technology report, and ask what's different, why did it …

Taking the MPI standard and the open MPI library to exascale

DE Bernholdt, G Bosilca, A Bouteiller… - … Journal of High …, 2024 - journals.sagepub.com
The Open MPI for Exascale (OMPI-X) project was one of two in the Exascale Computing
Project (ECP) focused on advancing the MPI ecosystem. The OMPI-X team worked with …

DGRO: Diameter-Guided Ring Optimization for Integrated Research Infrastructure Membership

S Wu, K Raghavan, S Di, Z Chen… - arxiv preprint arxiv …, 2024 - arxiv.org
Logical ring is a core component in membership protocol. However, the logic ring fails to
consider the underlying physical latency, resulting in a high diameter. To address this issue …

Legio: fault resiliency for embarrassingly parallel MPI applications

R Rocco, D Gadioli, G Palermo - The Journal of Supercomputing, 2022 - Springer
Due to the increasing size of HPC machines, dealing with faults is becoming mandatory due
to their high frequency. Natively, MPI cannot handle faults and it stops the execution …