Google Acadèmic

L Bautista-Gomez, A Benoit, S Di, T Herault… - Future Generation …, 2024 - Elsevier

Abstract The Young/Daly formula provides an approximation of the optimal checkpointing
period for a parallel application executing on a supercomputing platform. It was originally …

Desa Cita Citat per 2 Articles relacionats Totes les 8 versions Free GPT-4 DeepSeek

[Free GPT-4]
[DeepSeek]

[PDF] acm.org

Failures in large scale systems: long-term measurement, analysis, and implications

S Gupta, T Patel, C Engelmann, D Tiwari - Proceedings of the …, 2017 - dl.acm.org

Resilience is one of the key challenges in maintaining high efficiency of future extreme scale
supercomputers. Researchers and system practitioners rely on field-data studies to …

Desa Cita Citat per 184 Articles relacionats Totes les 12 versions Free GPT-4 DeepSeek

[Free GPT-4]
[DeepSeek]

[PDF] google.com

Job characteristics on large-scale systems: long-term analysis, quantification, and implications

T Patel, Z Liu, R Kettimuthu, P Rich… - … conference for high …, 2020 - ieeexplore.ieee.org

HPC workload analysis and resource consumption characteristics are the key to driving
better operation practices, system procurement decisions, and designing effective resource …

Desa Cita Citat per 70 Articles relacionats Totes les 4 versions Free GPT-4 DeepSeek

[Free GPT-4]
[DeepSeek]

[PDF] umn.edu

Doomsday: Predicting which node will fail when on supercomputers

A Das, F Mueller, P Hargrove… - … Conference for High …, 2018 - ieeexplore.ieee.org

Predicting which node will fail and how soon remains a challenge for HPC resilience, yet
may pave the way to exploiting proactive remedies before jobs fail. Not only for increasing …

Desa Cita Citat per 55 Articles relacionats Totes les 11 versions Free GPT-4 DeepSeek

[Free GPT-4]
[DeepSeek]

[PDF] osti.gov

GPU lifetimes on Titan supercomputer: Survival analysis and reliability

G Ostrouchov, D Maxwell, RA Ashraf… - … Conference for High …, 2020 - ieeexplore.ieee.org

The Cray XK7 Titan was the top supercomputer system in the world for a long time and
remained critically important throughout its nearly seven year life. It was an interesting …

Desa Cita Citat per 39 Articles relacionats Totes les 6 versions Free GPT-4 DeepSeek

[Free GPT-4]
[DeepSeek]

[PDF] google.com

What does power consumption behavior of hpc jobs reveal?: Demystifying, quantifying, and predicting power consumption characteristics

T Patel, A Wagenhäuser, C Eibel… - 2020 IEEE …, 2020 - ieeexplore.ieee.org

As we approach exascale computing, large-scale HPC systems are becoming increasingly
power-constrained, requiring them to run HPC workloads in an energy-efficient manner. The …

Desa Cita Citat per 39 Articles relacionats Totes les 3 versions Free GPT-4 DeepSeek

[Free GPT-4]
[DeepSeek]

[PDF] researchgate.net

Software approaches for resilience of high performance computing systems: a survey

J Jia, Y Liu, G Zhang, Y Gao, D Qian - Frontiers of Computer Science, 2023 - Springer

With the scaling up of high-performance computing systems in recent years, their reliability
has been descending continuously. Therefore, system resilience has been regarded as one …

Desa Cita Citat per 8 Articles relacionats Totes les 4 versions Free GPT-4 DeepSeek

[Free GPT-4]
[DeepSeek]

[PDF] christian-engelmann.info

Power-cap aware checkpointing: On the interplay among power-cap, temperature, reliability, performance, and energy

K Tang, D Tiwari, S Gupta, P Huang… - 2016 46th Annual …, 2016 - ieeexplore.ieee.org

Checkpoint and restart mechanisms have been widely used in large scientific simulation
applications to make forward progress in case of failures. However, none of the prior works …

Desa Cita Citat per 30 Articles relacionats Totes les 6 versions Free GPT-4 DeepSeek

[Free GPT-4]
[DeepSeek]

[PDF] hal.science

Assuming failure independence: are we right to be wrong?

G Aupy, Y Robert, F Vivien - 2017 IEEE International …, 2017 - ieeexplore.ieee.org

This paper revisits the failure 1 temporal independence hypothesis which is omnipresent in
the analysis of resilience methods for HPC. We explain why a previous approach is …

Desa Cita Citat per 24 Articles relacionats Totes les 11 versions Free GPT-4 DeepSeek

[Free GPT-4]
[DeepSeek]

[PDF] acm.org

CoREC: Scalable and resilient in-memory data staging for in-situ workflows

S Duan, P Subedi, P Davis, K Teranishi… - ACM Transactions on …, 2020 - dl.acm.org

The dramatic increase in the scale of current and planned high-end HPC systems is leading
new challenges, such as the growing costs of data movement and IO, and the reduced mean …

Desa Cita Citat per 14 Articles relacionats Totes les 4 versions Free GPT-4 DeepSeek

Crea una alerta

Cita

Cerca avançada

S'ha desat a La meva biblioteca

Reducing waste in extreme scale systems through introspective analysis

A survey on checkpointing strategies: Should we always checkpoint à la Young/Daly?

Failures in large scale systems: long-term measurement, analysis, and implications

Job characteristics on large-scale systems: long-term analysis, quantification, and implications

Doomsday: Predicting which node will fail when on supercomputers

GPU lifetimes on Titan supercomputer: Survival analysis and reliability

What does power consumption behavior of hpc jobs reveal?: Demystifying, quantifying, and predicting power consumption characteristics

Software approaches for resilience of high performance computing systems: a survey

Power-cap aware checkpointing: On the interplay among power-cap, temperature, reliability, performance, and energy

Assuming failure independence: are we right to be wrong?

CoREC: Scalable and resilient in-memory data staging for in-situ workflows