Predictive reliability and fault management in exascale systems: State of the art and perspectives
Performance and power constraints come together with Complementary Metal Oxide
Semiconductor technology scaling in future Exascale systems. Technology scaling makes …
Semiconductor technology scaling in future Exascale systems. Technology scaling makes …
CSI: Rowhammer–Cryptographic security and integrity against rowhammer
In this paper, we present CSI: Rowhammer, a principled hardware-software co-design
Rowhammer mitigation with cryptographic security and integrity guarantees, that does not …
Rowhammer mitigation with cryptographic security and integrity guarantees, that does not …
FT-CNN: Algorithm-based fault tolerance for convolutional neural networks
Convolutional neural networks (CNNs) are becoming more and more important for solving
challenging and critical problems in many fields. CNN inference applications have been …
challenging and critical problems in many fields. CNN inference applications have been …
A survey on checkpointing strategies: Should we always checkpoint à la Young/Daly?
Abstract The Young/Daly formula provides an approximation of the optimal checkpointing
period for a parallel application executing on a supercomputing platform. It was originally …
period for a parallel application executing on a supercomputing platform. It was originally …
What can we learn from four years of data center hardware failures?
Hardware failures have a big impact on the dependability of large-scale data centers. We
present studies on over 290,000 hardware failure reports collected over the past four years …
present studies on over 290,000 hardware failure reports collected over the past four years …
Desh: deep learning for system health prediction of lead times to failure in hpc
Today's large-scale supercomputers encounter faults on a daily basis. Exascale systems are
likely to experience even higher fault rates due to increased component count and density …
likely to experience even higher fault rates due to increased component count and density …
Silent data errors: Sources, detection, and modeling
A Singh, S Chakravarty, G Papadimitriou… - 2023 IEEE 41st VLSI …, 2023 - ieeexplore.ieee.org
Chip manufacturers and hyperscalers are becoming increasingly aware of the problem
posed by Silent Data Errors (SDE) and are taking steps to address it. Major computing …
posed by Silent Data Errors (SDE) and are taking steps to address it. Major computing …
HARP: Practically and effectively identifying uncorrectable errors in memory chips that use on-die error-correcting codes
Aggressive storage density scaling in modern main memories causes increasing error rates
that are addressed using error-mitigation techniques. State-of-the-art techniques for …
that are addressed using error-mitigation techniques. State-of-the-art techniques for …
Silent data corruptions: The stealthy saboteurs of digital integrity
Silent Data Corruptions (SDCs) pose a significant threat to the integrity of digital systems.
These stealthy saboteurs silently corrupt data, remaining undetected by traditional error …
These stealthy saboteurs silently corrupt data, remaining undetected by traditional error …
Predicting faults in high performance computing systems: An in-depth survey of the state-of-the-practice
As we near exascale, resilience remains a major technical hurdle. Any technique with the
goal of achieving resilience suffers from having to be reactive, as failures can appear at any …
goal of achieving resilience suffers from having to be reactive, as failures can appear at any …