- Academic Search

T Li, JA Ambrose, R Ragel… - ACM Computing Surveys …, 2016 - dl.acm.org

Today, soft errors are one of the major design technology challenges at and beyond the
22nm technology nodes. This article introduces the soft error problem from the perspective …

Save Cite Cited by 40 Related articles All 3 versions Free GPT-4

[Free GPT-4]

[PDF] osti.gov

Practical resource management in power-constrained, high performance computing

T Patki, DK Lowenthal, A Sasidharan… - Proceedings of the 24th …, 2015 - dl.acm.org

Power management is one of the key research challenges on the path to exascale.
Supercomputers today are designed to be worst-case power provisioned, leading to two …

Save Cite Cited by 132 Related articles All 22 versions Free GPT-4

[Free GPT-4]

[PDF] sciencedirect.com

Fault tolerance of MPI applications in exascale systems: The ULFM solution

N Losada, P González, MJ Martín, G Bosilca… - Future Generation …, 2020 - Elsevier

The growth in the number of computational resources used by high-performance computing
(HPC) systems leads to an increase in failure rates. Fault-tolerant techniques will become …

Save Cite Cited by 59 Related articles All 7 versions Free GPT-4

[Free GPT-4]

[PDF] peerj.com

[PDF][PDF] Collaboro: a collaborative (meta) modeling tool

JLC Izquierdo, J Cabot - PeerJ Computer Science, 2016 - peerj.com

Motivation Scientists increasingly rely on intelligent information systems to help them in their
daily tasks, in particular for managing research objects, like publications or datasets. The …

Save Cite Cited by 39 Related articles All 9 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] osti.gov

Evaluating and extending user-level fault tolerance in MPI applications

I Laguna, DF Richards, T Gamblin… - … Journal of High …, 2016 - journals.sagepub.com

The user-level failure mitigation (ULFM) interface has been proposed to provide fault-
tolerant semantics in the Message Passing Interface (MPI). Previous work presented …

Save Cite Cited by 61 Related articles All 2 versions Free GPT-4

Strategies for fault-tolerant tightly-coupled hpc workloads running on low-budget spot cloud infrastructures

V Munhoz, M Castro… - 2022 IEEE 34th …, 2022 - ieeexplore.ieee.org

Cloud providers can rent their spare computing capacity at substantial discounts, reclaiming
it whenever there is a more profitable higher-priority request-a business model well known …

Save Cite Cited by 15 Related articles All 3 versions Free GPT-4

[Free GPT-4]

[HTML] nih.gov

Reinit: Evaluating the performance of global-restart recovery methods for mpi fault tolerance

G Georgakoudis, L Guo, I Laguna - International Conference on High …, 2020 - Springer

Scaling supercomputers comes with an increase in failure rates due to the increasing
number of hardware components. In standard practice, applications are made resilient …

Save Cite Cited by 33 Related articles All 6 versions Free GPT-4

[Free GPT-4]

[PDF] sciencedirect.com

Local rollback for resilient MPI applications with application-level checkpointing and message logging

N Losada, G Bosilca, A Bouteiller, P González… - Future Generation …, 2019 - Elsevier

The resilience approach generally used in high-performance computing (HPC) relies on
coordinated checkpoint/restart, a global rollback of all the processes that are running the …

Save Cite Cited by 34 Related articles All 7 versions Free GPT-4

[Free GPT-4]

[PDF] superfri.org

Exascale machines require new programming paradigms and runtimes

G Da Costa, T Fahringer, JAR Gallego… - Supercomputing …, 2015 - superfri.org

Extreme scale parallel computing systems will have tens of thousands of optionally
accelerator-equiped nodes with hundreds of cores each, as well as deep memory …

Save Cite Cited by 50 Related articles All 23 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] arxiv.org

Autocheck: Automatically identifying variables for checkpointing by data dependency analysis

X Fu, W Zhang, S Meng, X Huang, W Xu… - … Conference for High …, 2024 - ieeexplore.ieee.org

Checkpoint/Restart (C/R) has been widely deployed in numerous HPC systems, Clouds,
and industrial data centers, which are typically operated by system engineers. Nevertheless …

Save Cite Cited by 1 Related articles All 5 versions Free GPT-4

Create alert

Cite

Advanced search

Saved to My library

Evaluating user-level fault tolerance for MPI applications

Processor design for soft errors: Challenges and state of the art

Practical resource management in power-constrained, high performance computing

Fault tolerance of MPI applications in exascale systems: The ULFM solution

[PDF][PDF] Collaboro: a collaborative (meta) modeling tool

Evaluating and extending user-level fault tolerance in MPI applications

Strategies for fault-tolerant tightly-coupled hpc workloads running on low-budget spot cloud infrastructures

Reinit: Evaluating the performance of global-restart recovery methods for mpi fault tolerance

Local rollback for resilient MPI applications with application-level checkpointing and message logging

Exascale machines require new programming paradigms and runtimes

Autocheck: Automatically identifying variables for checkpointing by data dependency analysis