Turnitin
降AI改写
早检测系统
早降重系统
Turnitin-UK版
万方检测-期刊版
维普编辑部版
Grammarly检测
Paperpass检测
checkpass检测
PaperYY检测
A survey on checkpointing strategies: Should we always checkpoint à la Young/Daly?
Abstract The Young/Daly formula provides an approximation of the optimal checkpointing
period for a parallel application executing on a supercomputing platform. It was originally …
period for a parallel application executing on a supercomputing platform. It was originally …
Failures in large scale systems: long-term measurement, analysis, and implications
Resilience is one of the key challenges in maintaining high efficiency of future extreme scale
supercomputers. Researchers and system practitioners rely on field-data studies to …
supercomputers. Researchers and system practitioners rely on field-data studies to …
Job characteristics on large-scale systems: long-term analysis, quantification, and implications
HPC workload analysis and resource consumption characteristics are the key to driving
better operation practices, system procurement decisions, and designing effective resource …
better operation practices, system procurement decisions, and designing effective resource …
Doomsday: Predicting which node will fail when on supercomputers
Predicting which node will fail and how soon remains a challenge for HPC resilience, yet
may pave the way to exploiting proactive remedies before jobs fail. Not only for increasing …
may pave the way to exploiting proactive remedies before jobs fail. Not only for increasing …
GPU lifetimes on Titan supercomputer: Survival analysis and reliability
The Cray XK7 Titan was the top supercomputer system in the world for a long time and
remained critically important throughout its nearly seven year life. It was an interesting …
remained critically important throughout its nearly seven year life. It was an interesting …
What does power consumption behavior of hpc jobs reveal?: Demystifying, quantifying, and predicting power consumption characteristics
T Patel, A Wagenhäuser, C Eibel… - 2020 IEEE …, 2020 - ieeexplore.ieee.org
As we approach exascale computing, large-scale HPC systems are becoming increasingly
power-constrained, requiring them to run HPC workloads in an energy-efficient manner. The …
power-constrained, requiring them to run HPC workloads in an energy-efficient manner. The …
Software approaches for resilience of high performance computing systems: a survey
With the scaling up of high-performance computing systems in recent years, their reliability
has been descending continuously. Therefore, system resilience has been regarded as one …
has been descending continuously. Therefore, system resilience has been regarded as one …
Power-cap** aware checkpointing: On the interplay among power-cap**, temperature, reliability, performance, and energy
Checkpoint and restart mechanisms have been widely used in large scientific simulation
applications to make forward progress in case of failures. However, none of the prior works …
applications to make forward progress in case of failures. However, none of the prior works …
Assuming failure independence: are we right to be wrong?
This paper revisits the failure 1 temporal independence hypothesis which is omnipresent in
the analysis of resilience methods for HPC. We explain why a previous approach is …
the analysis of resilience methods for HPC. We explain why a previous approach is …
CoREC: Scalable and resilient in-memory data staging for in-situ workflows
The dramatic increase in the scale of current and planned high-end HPC systems is leading
new challenges, such as the growing costs of data movement and IO, and the reduced mean …
new challenges, such as the growing costs of data movement and IO, and the reduced mean …