Understanding error propagation in deep learning neural network (DNN) accelerators and applications

G Li, SKS Hari, M Sullivan, T Tsai… - Proceedings of the …, 2017 - dl.acm.org
Deep learning neural networks (DNNs) have been successful in solving a wide range of
machine learning problems. Specialized hardware accelerators have been proposed to …

Addressing failures in exascale computing

M Snir, RW Wisniewski, JA Abraham… - … Journal of High …, 2014 - journals.sagepub.com
We present here a report produced by a workshop on 'Addressing failures in exascale
computing'held in Park City, Utah, 4–11 August 2012. The charter of this workshop was to …

Quantifying the accuracy of high-level fault injection techniques for hardware faults

J Wei, A Thomas, G Li… - 2014 44th Annual IEEE …, 2014 - ieeexplore.ieee.org
Hardware errors are on the rise with reducing feature sizes, however tolerating them in
hardware is expensive. Researchers have explored software-based techniques for building …

Understanding and mitigating hardware failures in deep learning training systems

Y He, M Hutton, S Chan, R De Gruijl… - Proceedings of the 50th …, 2023 - dl.acm.org
Deep neural network (DNN) training workloads are increasingly susceptible to hardware
failures in datacenters. For example, Google experienced" mysterious, difficult to identify …

Llfi: An intermediate code-level fault injection tool for hardware faults

Q Lu, M Farahani, J Wei, A Thomas… - … on Software Quality …, 2015 - ieeexplore.ieee.org
Hardware errors are becoming more prominent with reducing feature sizes, however
tolerating them exclusively in hardware is expensive. Researchers have explored software …

Silent data corruptions: Microarchitectural perspectives

G Papadimitriou, D Gizopoulos - IEEE Transactions on …, 2023 - ieeexplore.ieee.org
Today more than ever before, academia, manufacturers, and hyperscalers acknowledge the
major challenge of silent data corruptions (SDCs) and aim on solutions to minimize its …

Approxilyzer: Towards a systematic framework for instruction-level approximate computing and its application to hardware resiliency

R Venkatagiri, A Mahmoud, SKS Hari… - 2016 49th Annual …, 2016 - ieeexplore.ieee.org
Approximate computing environments trade off computational accuracy for improvements in
performance, energy, and resiliency cost. For widespread adoption of approximate …

Clear: C ross-l ayer e xploration for a rchitecting r esilience-combining hardware and software techniques to tolerate soft errors in processor cores

E Cheng, S Mirkhani, LG Szafaryn, CY Cher… - Proceedings of the 53rd …, 2016 - dl.acm.org
We present a first of its kind framework which overcomes a major challenge in the design of
digital systems that are resilient to reliability failures: achieve desired resilience targets at …

FAIL*: An open and versatile fault-injection framework for the assessment of software-implemented hardware fault tolerance

H Schirmeier, M Hoffmann, C Dietrich… - 2015 11th european …, 2015 - ieeexplore.ieee.org
Due to voltage and structure shrinking, the influence of radiation on a circuit's operation
increases, resulting in future hardware designs exhibiting much higher rates of soft errors …

Ipas: Intelligent protection against silent output corruption in scientific applications

I Laguna, M Schulz, DF Richards, J Calhoun… - Proceedings of the 2016 …, 2016 - dl.acm.org
This paper presents IPAS, an instruction duplication technique that protects scientific
applications from silent data corruption (SDC) in their output. The motivation for IPAS is that …