BinFI an efficient fault injector for safety-critical machine learning systems

Z Chen, G Li, K Pattabiraman… - Proceedings of the …, 2019 - dl.acm.org
As machine learning (ML) becomes pervasive in high performance computing, ML has
found its way into safety-critical domains (eg, autonomous vehicles). Thus the reliability of …

A low-cost fault corrector for deep neural networks through range restriction

Z Chen, G Li, K Pattabiraman - 2021 51st Annual IEEE/IFIP …, 2021 - ieeexplore.ieee.org
The adoption of deep neural networks (DNNs) in safety-critical domains has engendered
serious reliability concerns. A prominent example is hardware transient faults that are …

Modeling soft-error propagation in programs

G Li, K Pattabiraman, SKS Hari… - 2018 48th Annual …, 2018 - ieeexplore.ieee.org
As technology scales to lower feature sizes, devices become more susceptible to soft errors.
Soft errors can lead to silent data corruptions (SDCs), seriously compromising the reliability …

[PDF][PDF] Optimizing Selective Protection for CNN Resilience.

A Mahmoud, SKS Hari, CW Fletcher, SV Adve, C Sakr… - ISSRE, 2021 - ma3mool.github.io
As CNNs are being extensively employed in high performance and safety-critical
applications that demand high reliability, it is important to ensure that they are resilient to …

Understanding error propagation in GPGPU applications

G Li, K Pattabiraman, CY Cher… - SC'16: Proceedings of …, 2016 - ieeexplore.ieee.org
GPUs have emerged as general-purpose accelerators in high-performance computing
(HPC) and scientific applications. However, the reliability characteristics of GPU applications …

Unprotected computing: A large-scale study of dram raw error rate on a supercomputer

L Bautista-Gomez, F Zyulkyarov… - SC'16: Proceedings …, 2016 - ieeexplore.ieee.org
Supercomputers offer new opportunities for scientific computing as they grow in size.
However, their growth also poses new challenges. Resilience has been recognized as one …

Experimental and analytical study of xeon phi reliability

D Oliveira, L Pilla, N DeBardeleben… - Proceedings of the …, 2017 - dl.acm.org
We present an in-depth analysis of transient faults effects on HPC applications in Intel Xeon
Phi processors based on radiation experiments and high-level fault injection. Besides …

Using machine learning techniques to evaluate multicore soft error reliability

FR da Rosa, R Garibotti, L Ost… - IEEE Transactions on …, 2019 - ieeexplore.ieee.org
Virtual platform frameworks have been extended to allow earlier soft error analysis of more
realistic multicore systems (ie, real software stacks and state-of-the-art ISAs). The high …

Demystifying and mitigating cross-layer deficiencies of soft error protection in instruction duplication

Z He, Y Huang, H Xu, D Tao, G Li - Proceedings of the International …, 2023 - dl.acm.org
Soft errors are prevalent in modern High-Performance Computing (HPC) systems, resulting
in silent data corruptions (SDCs), compromising system reliability. Instruction duplication is a …

Refine: Realistic fault injection via compiler-based instrumentation for accuracy, portability and speed

G Georgakoudis, I Laguna, DS Nikolopoulos… - Proceedings of the …, 2017 - dl.acm.org
Compiler-based fault injection (FI) has become a popular technique for resilience studies to
understand the impact of soft errors in supercomputing systems. Compiler-based FI …