Understanding error propagation in deep learning neural network (DNN) accelerators and applications

G Li, SKS Hari, M Sullivan, T Tsai… - Proceedings of the …, 2017 - dl.acm.org
Deep learning neural networks (DNNs) have been successful in solving a wide range of
machine learning problems. Specialized hardware accelerators have been proposed to …

SASSIFI: An architecture-level fault injection tool for GPU application resilience evaluation

SKS Hari, T Tsai, M Stephenson… - … Analysis of Systems …, 2017 - ieeexplore.ieee.org
As GPUs become more pervasive in both scalable high-performance computing systems
and safety-critical embedded systems, evaluating and analyzing their resilience to soft errors …

Demystifying the system vulnerability stack: Transient fault effects across the layers

G Papadimitriou, D Gizopoulos - 2021 ACM/IEEE 48th Annual …, 2021 - ieeexplore.ieee.org
In this paper, we revisit the system vulnerability stack for transient faults. We reveal severe
pitfalls in widely used vulnerability measurement approaches, which separate the hardware …

Artificial neural networks for space and safety-critical applications: Reliability issues and potential solutions

P Rech - IEEE Transactions on Nuclear Science, 2024 - ieeexplore.ieee.org
Machine learning is among the greatest advancements in computer science and
engineering and is today used to classify or detect objects, a key feature in autonomous …

BinFI an efficient fault injector for safety-critical machine learning systems

Z Chen, G Li, K Pattabiraman… - Proceedings of the …, 2019 - dl.acm.org
As machine learning (ML) becomes pervasive in high performance computing, ML has
found its way into safety-critical domains (eg, autonomous vehicles). Thus the reliability of …

A low-cost fault corrector for deep neural networks through range restriction

Z Chen, G Li, K Pattabiraman - 2021 51st Annual IEEE/IFIP …, 2021 - ieeexplore.ieee.org
The adoption of deep neural networks (DNNs) in safety-critical domains has engendered
serious reliability concerns. A prominent example is hardware transient faults that are …

Modeling soft-error propagation in programs

G Li, K Pattabiraman, SKS Hari… - 2018 48th Annual …, 2018 - ieeexplore.ieee.org
As technology scales to lower feature sizes, devices become more susceptible to soft errors.
Soft errors can lead to silent data corruptions (SDCs), seriously compromising the reliability …

Tensorfi: A flexible fault injection framework for tensorflow applications

Z Chen, N Narayanan, B Fang, G Li… - 2020 IEEE 31st …, 2020 - ieeexplore.ieee.org
As machine learning (ML) has seen increasing adoption in safety-critical domains (eg,
autonomous vehicles), the reliability of ML systems has also grown in importance. While …

Avgi: Microarchitecture-driven, fast and accurate vulnerability assessment

G Papadimitriou, D Gizopoulos - 2023 IEEE International …, 2023 - ieeexplore.ieee.org
We propose AVGI, a new Statistical Fault Injection (SFI)-based methodology, which delivers
orders of magnitude faster assessment of the Architectural Vulnerability Factor (AVF) of a …

Llfi: An intermediate code-level fault injection tool for hardware faults

Q Lu, M Farahani, J Wei, A Thomas… - … on Software Quality …, 2015 - ieeexplore.ieee.org
Hardware errors are becoming more prominent with reducing feature sizes, however
tolerating them exclusively in hardware is expensive. Researchers have explored software …