A survey of techniques for modeling and improving reliability of computing systems
Recent trends of aggressive technology scaling have greatly exacerbated the occurrences
and impact of faults in computing systems. This has madereliability'a first-order design …
and impact of faults in computing systems. This has madereliability'a first-order design …
Demystifying the system vulnerability stack: Transient fault effects across the layers
In this paper, we revisit the system vulnerability stack for transient faults. We reveal severe
pitfalls in widely used vulnerability measurement approaches, which separate the hardware …
pitfalls in widely used vulnerability measurement approaches, which separate the hardware …
BinFI an efficient fault injector for safety-critical machine learning systems
As machine learning (ML) becomes pervasive in high performance computing, ML has
found its way into safety-critical domains (eg, autonomous vehicles). Thus the reliability of …
found its way into safety-critical domains (eg, autonomous vehicles). Thus the reliability of …
Modeling soft-error propagation in programs
As technology scales to lower feature sizes, devices become more susceptible to soft errors.
Soft errors can lead to silent data corruptions (SDCs), seriously compromising the reliability …
Soft errors can lead to silent data corruptions (SDCs), seriously compromising the reliability …
Avgi: Microarchitecture-driven, fast and accurate vulnerability assessment
We propose AVGI, a new Statistical Fault Injection (SFI)-based methodology, which delivers
orders of magnitude faster assessment of the Architectural Vulnerability Factor (AVF) of a …
orders of magnitude faster assessment of the Architectural Vulnerability Factor (AVF) of a …
ePVF: An enhanced program vulnerability factor methodology for cross-layer resilience analysis
The Program Vulnerability Factor (PVF) has been proposed as a metric to understand the
impact of hardware faults on software. The PVF is calculated by identifying the program bits …
impact of hardware faults on software. The PVF is calculated by identifying the program bits …
Using machine learning techniques to evaluate multicore soft error reliability
Virtual platform frameworks have been extended to allow earlier soft error analysis of more
realistic multicore systems (ie, real software stacks and state-of-the-art ISAs). The high …
realistic multicore systems (ie, real software stacks and state-of-the-art ISAs). The high …
Anatomy of on-chip memory hardware fault effects across the layers
Reliability evaluation of a microprocessor design may reveal vulnerable silicon areas that
require protection against faults, but also hardware structures that are inherently more …
require protection against faults, but also hardware structures that are inherently more …
MeRLiN: Exploiting dynamic instruction behavior for fast and accurate microarchitecture level reliability assessment
Early reliability assessment of hardware structures using microarchitecture level simulators
can effectively guide major error protection decisions in microprocessor design. Statistical …
can effectively guide major error protection decisions in microprocessor design. Statistical …
Featherweight soft error resilience for GPUs
This paper presents Flame, a hardware/software co-designed resilience scheme for
protecting GPUs against soft errors. For low-cost yet high-performance resilience, Flame …
protecting GPUs against soft errors. For low-cost yet high-performance resilience, Flame …