A survey of techniques for modeling and improving reliability of computing systems

S Mittal, JS Vetter - IEEE Transactions on Parallel and …, 2015 - ieeexplore.ieee.org
Recent trends of aggressive technology scaling have greatly exacerbated the occurrences
and impact of faults in computing systems. This has madereliability'a first-order design …

Demystifying the system vulnerability stack: Transient fault effects across the layers

G Papadimitriou, D Gizopoulos - 2021 ACM/IEEE 48th Annual …, 2021 - ieeexplore.ieee.org
In this paper, we revisit the system vulnerability stack for transient faults. We reveal severe
pitfalls in widely used vulnerability measurement approaches, which separate the hardware …

BinFI an efficient fault injector for safety-critical machine learning systems

Z Chen, G Li, K Pattabiraman… - Proceedings of the …, 2019 - dl.acm.org
As machine learning (ML) becomes pervasive in high performance computing, ML has
found its way into safety-critical domains (eg, autonomous vehicles). Thus the reliability of …

Modeling soft-error propagation in programs

G Li, K Pattabiraman, SKS Hari… - 2018 48th Annual …, 2018 - ieeexplore.ieee.org
As technology scales to lower feature sizes, devices become more susceptible to soft errors.
Soft errors can lead to silent data corruptions (SDCs), seriously compromising the reliability …

Avgi: Microarchitecture-driven, fast and accurate vulnerability assessment

G Papadimitriou, D Gizopoulos - 2023 IEEE International …, 2023 - ieeexplore.ieee.org
We propose AVGI, a new Statistical Fault Injection (SFI)-based methodology, which delivers
orders of magnitude faster assessment of the Architectural Vulnerability Factor (AVF) of a …

ePVF: An enhanced program vulnerability factor methodology for cross-layer resilience analysis

B Fang, Q Lu, K Pattabiraman… - 2016 46th Annual …, 2016 - ieeexplore.ieee.org
The Program Vulnerability Factor (PVF) has been proposed as a metric to understand the
impact of hardware faults on software. The PVF is calculated by identifying the program bits …

Using machine learning techniques to evaluate multicore soft error reliability

FR da Rosa, R Garibotti, L Ost… - IEEE Transactions on …, 2019 - ieeexplore.ieee.org
Virtual platform frameworks have been extended to allow earlier soft error analysis of more
realistic multicore systems (ie, real software stacks and state-of-the-art ISAs). The high …

Anatomy of on-chip memory hardware fault effects across the layers

G Papadimitriou, D Gizopoulos - IEEE Transactions on …, 2022 - ieeexplore.ieee.org
Reliability evaluation of a microprocessor design may reveal vulnerable silicon areas that
require protection against faults, but also hardware structures that are inherently more …

MeRLiN: Exploiting dynamic instruction behavior for fast and accurate microarchitecture level reliability assessment

M Kaliorakis, D Gizopoulos, R Canal… - Proceedings of the 44th …, 2017 - dl.acm.org
Early reliability assessment of hardware structures using microarchitecture level simulators
can effectively guide major error protection decisions in microprocessor design. Statistical …

Featherweight soft error resilience for GPUs

Y Zhang, C Jung - … 55th IEEE/ACM International Symposium on …, 2022 - ieeexplore.ieee.org
This paper presents Flame, a hardware/software co-designed resilience scheme for
protecting GPUs against soft errors. For low-cost yet high-performance resilience, Flame …