A low-cost fault corrector for deep neural networks through range restriction
The adoption of deep neural networks (DNNs) in safety-critical domains has engendered
serious reliability concerns. A prominent example is hardware transient faults that are …
serious reliability concerns. A prominent example is hardware transient faults that are …
BinFI an efficient fault injector for safety-critical machine learning systems
As machine learning (ML) becomes pervasive in high performance computing, ML has
found its way into safety-critical domains (eg, autonomous vehicles). Thus the reliability of …
found its way into safety-critical domains (eg, autonomous vehicles). Thus the reliability of …
Modeling soft-error propagation in programs
As technology scales to lower feature sizes, devices become more susceptible to soft errors.
Soft errors can lead to silent data corruptions (SDCs), seriously compromising the reliability …
Soft errors can lead to silent data corruptions (SDCs), seriously compromising the reliability …
[PDF][PDF] Optimizing selective protection for CNN resilience
As CNNs are being extensively employed in high performance and safety-critical
applications that demand high reliability, it is important to ensure that they are resilient to …
applications that demand high reliability, it is important to ensure that they are resilient to …
Understanding error propagation in GPGPU applications
GPUs have emerged as general-purpose accelerators in high-performance computing
(HPC) and scientific applications. However, the reliability characteristics of GPU applications …
(HPC) and scientific applications. However, the reliability characteristics of GPU applications …
Unprotected computing: A large-scale study of dram raw error rate on a supercomputer
Supercomputers offer new opportunities for scientific computing as they grow in size.
However, their growth also poses new challenges. Resilience has been recognized as one …
However, their growth also poses new challenges. Resilience has been recognized as one …
Using machine learning techniques to evaluate multicore soft error reliability
Virtual platform frameworks have been extended to allow earlier soft error analysis of more
realistic multicore systems (ie, real software stacks and state-of-the-art ISAs). The high …
realistic multicore systems (ie, real software stacks and state-of-the-art ISAs). The high …
Experimental and analytical study of xeon phi reliability
We present an in-depth analysis of transient faults effects on HPC applications in Intel Xeon
Phi processors based on radiation experiments and high-level fault injection. Besides …
Phi processors based on radiation experiments and high-level fault injection. Besides …
Refine: Realistic fault injection via compiler-based instrumentation for accuracy, portability and speed
Compiler-based fault injection (FI) has become a popular technique for resilience studies to
understand the impact of soft errors in supercomputing systems. Compiler-based FI …
understand the impact of soft errors in supercomputing systems. Compiler-based FI …
Correcting soft errors online in fast fourier transform
While many algorithm-based fault tolerance (ABFT) schemes have been proposed to detect
soft errors offline in the fast Fourier transform (FFT) after computation finishes, none of the …
soft errors offline in the fast Fourier transform (FFT) after computation finishes, none of the …