BinFI an efficient fault injector for safety-critical machine learning systems
As machine learning (ML) becomes pervasive in high performance computing, ML has
found its way into safety-critical domains (eg, autonomous vehicles). Thus the reliability of …
found its way into safety-critical domains (eg, autonomous vehicles). Thus the reliability of …
A low-cost fault corrector for deep neural networks through range restriction
The adoption of deep neural networks (DNNs) in safety-critical domains has engendered
serious reliability concerns. A prominent example is hardware transient faults that are …
serious reliability concerns. A prominent example is hardware transient faults that are …
Modeling soft-error propagation in programs
As technology scales to lower feature sizes, devices become more susceptible to soft errors.
Soft errors can lead to silent data corruptions (SDCs), seriously compromising the reliability …
Soft errors can lead to silent data corruptions (SDCs), seriously compromising the reliability …
[PDF][PDF] Optimizing Selective Protection for CNN Resilience.
As CNNs are being extensively employed in high performance and safety-critical
applications that demand high reliability, it is important to ensure that they are resilient to …
applications that demand high reliability, it is important to ensure that they are resilient to …
Understanding error propagation in GPGPU applications
GPUs have emerged as general-purpose accelerators in high-performance computing
(HPC) and scientific applications. However, the reliability characteristics of GPU applications …
(HPC) and scientific applications. However, the reliability characteristics of GPU applications …
Unprotected computing: A large-scale study of dram raw error rate on a supercomputer
Supercomputers offer new opportunities for scientific computing as they grow in size.
However, their growth also poses new challenges. Resilience has been recognized as one …
However, their growth also poses new challenges. Resilience has been recognized as one …
Experimental and analytical study of xeon phi reliability
We present an in-depth analysis of transient faults effects on HPC applications in Intel Xeon
Phi processors based on radiation experiments and high-level fault injection. Besides …
Phi processors based on radiation experiments and high-level fault injection. Besides …
Using machine learning techniques to evaluate multicore soft error reliability
Virtual platform frameworks have been extended to allow earlier soft error analysis of more
realistic multicore systems (ie, real software stacks and state-of-the-art ISAs). The high …
realistic multicore systems (ie, real software stacks and state-of-the-art ISAs). The high …
Demystifying and mitigating cross-layer deficiencies of soft error protection in instruction duplication
Soft errors are prevalent in modern High-Performance Computing (HPC) systems, resulting
in silent data corruptions (SDCs), compromising system reliability. Instruction duplication is a …
in silent data corruptions (SDCs), compromising system reliability. Instruction duplication is a …
Refine: Realistic fault injection via compiler-based instrumentation for accuracy, portability and speed
Compiler-based fault injection (FI) has become a popular technique for resilience studies to
understand the impact of soft errors in supercomputing systems. Compiler-based FI …
understand the impact of soft errors in supercomputing systems. Compiler-based FI …