Understanding error propagation in deep learning neural network (DNN) accelerators and applications
Deep learning neural networks (DNNs) have been successful in solving a wide range of
machine learning problems. Specialized hardware accelerators have been proposed to …
machine learning problems. Specialized hardware accelerators have been proposed to …
Addressing failures in exascale computing
We present here a report produced by a workshop on 'Addressing failures in exascale
computing'held in Park City, Utah, 4–11 August 2012. The charter of this workshop was to …
computing'held in Park City, Utah, 4–11 August 2012. The charter of this workshop was to …
Quantifying the accuracy of high-level fault injection techniques for hardware faults
Hardware errors are on the rise with reducing feature sizes, however tolerating them in
hardware is expensive. Researchers have explored software-based techniques for building …
hardware is expensive. Researchers have explored software-based techniques for building …
Understanding and mitigating hardware failures in deep learning training systems
Y He, M Hutton, S Chan, R De Gruijl… - Proceedings of the 50th …, 2023 - dl.acm.org
Deep neural network (DNN) training workloads are increasingly susceptible to hardware
failures in datacenters. For example, Google experienced" mysterious, difficult to identify …
failures in datacenters. For example, Google experienced" mysterious, difficult to identify …
Llfi: An intermediate code-level fault injection tool for hardware faults
Q Lu, M Farahani, J Wei, A Thomas… - … on Software Quality …, 2015 - ieeexplore.ieee.org
Hardware errors are becoming more prominent with reducing feature sizes, however
tolerating them exclusively in hardware is expensive. Researchers have explored software …
tolerating them exclusively in hardware is expensive. Researchers have explored software …
Silent data corruptions: Microarchitectural perspectives
Today more than ever before, academia, manufacturers, and hyperscalers acknowledge the
major challenge of silent data corruptions (SDCs) and aim on solutions to minimize its …
major challenge of silent data corruptions (SDCs) and aim on solutions to minimize its …
Approxilyzer: Towards a systematic framework for instruction-level approximate computing and its application to hardware resiliency
Approximate computing environments trade off computational accuracy for improvements in
performance, energy, and resiliency cost. For widespread adoption of approximate …
performance, energy, and resiliency cost. For widespread adoption of approximate …
Clear: C ross-l ayer e xploration for a rchitecting r esilience-combining hardware and software techniques to tolerate soft errors in processor cores
We present a first of its kind framework which overcomes a major challenge in the design of
digital systems that are resilient to reliability failures: achieve desired resilience targets at …
digital systems that are resilient to reliability failures: achieve desired resilience targets at …
FAIL*: An open and versatile fault-injection framework for the assessment of software-implemented hardware fault tolerance
Due to voltage and structure shrinking, the influence of radiation on a circuit's operation
increases, resulting in future hardware designs exhibiting much higher rates of soft errors …
increases, resulting in future hardware designs exhibiting much higher rates of soft errors …
Ipas: Intelligent protection against silent output corruption in scientific applications
This paper presents IPAS, an instruction duplication technique that protects scientific
applications from silent data corruption (SDC) in their output. The motivation for IPAS is that …
applications from silent data corruption (SDC) in their output. The motivation for IPAS is that …