Demystifying the system vulnerability stack: Transient fault effects across the layers

G Papadimitriou, D Gizopoulos - 2021 ACM/IEEE 48th Annual …, 2021 - ieeexplore.ieee.org
In this paper, we revisit the system vulnerability stack for transient faults. We reveal severe
pitfalls in widely used vulnerability measurement approaches, which separate the hardware …

Artificial neural networks for space and safety-critical applications: Reliability issues and potential solutions

P Rech - IEEE Transactions on Nuclear Science, 2024 - ieeexplore.ieee.org
Machine learning is among the greatest advancements in computer science and
engineering and is today used to classify or detect objects, a key feature in autonomous …

Understanding and mitigating hardware failures in deep learning training systems

Y He, M Hutton, S Chan, R De Gruijl… - Proceedings of the 50th …, 2023 - dl.acm.org
Deep neural network (DNN) training workloads are increasingly susceptible to hardware
failures in datacenters. For example, Google experienced" mysterious, difficult to identify …

Avgi: Microarchitecture-driven, fast and accurate vulnerability assessment

G Papadimitriou, D Gizopoulos - 2023 IEEE International …, 2023 - ieeexplore.ieee.org
We propose AVGI, a new Statistical Fault Injection (SFI)-based methodology, which delivers
orders of magnitude faster assessment of the Architectural Vulnerability Factor (AVF) of a …

Survey on Redundancy Based-Fault tolerance methods for Processors and Hardware accelerators-Trends in Quantum Computing, Heterogeneous Systems and …

S Venkatesha, R Parthasarathi - ACM Computing Surveys, 2024 - dl.acm.org
Rapid progress in the CMOS technology for the past 25 years has increased the
vulnerability of processors towards faults. Subsequently, focus of computer architects shifted …

Soft error effects on arm microprocessors: Early estimations versus chip measurements

PR Bodmann, G Papadimitriou… - IEEE Transactions …, 2021 - ieeexplore.ieee.org
Extensive research efforts are being carried out to evaluate and improve the reliability of
computing devices either through beam experiments or simulation-based fault injection …

Impact of voltage scaling on soft errors susceptibility of multicore server cpus

D Agiakatsikas, G Papadimitriou, V Karakostas… - Proceedings of the 56th …, 2023 - dl.acm.org
Microprocessor power consumption and dependability are both crucial challenges that
designers have to cope with due to shrinking feature sizes and increasing transistor counts …

Harpocrates: Breaking the silence of cpu faults through hardware-in-the-loop program generation

N Karystinos, O Chatzopoulos… - 2024 ACM/IEEE 51st …, 2024 - ieeexplore.ieee.org
Several hyperscalers have recently disclosed the occurrence of Silent Data Corruptions
(SDCs) in their systems fleets, sparking concerns about the severity of known and the …

Revealing gpus vulnerabilities by combining register-transfer and software-level fault injection

FF dos Santos, JER Condia, L Carro… - 2021 51st Annual …, 2021 - ieeexplore.ieee.org
The complexity of both hardware and software makes GPUs reliability evaluation extremely
challenging. A low level fault injection on a GPU model, despite being accurate, would take …

Silent data errors: Sources, detection, and modeling

A Singh, S Chakravarty, G Papadimitriou… - 2023 IEEE 41st VLSI …, 2023 - ieeexplore.ieee.org
Chip manufacturers and hyperscalers are becoming increasingly aware of the problem
posed by Silent Data Errors (SDE) and are taking steps to address it. Major computing …