Soft errors in DNN accelerators: A comprehensive review

Y Ibrahim, H Wang, J Liu, J Wei, L Chen, P Rech… - Microelectronics …, 2020 - Elsevier
Deep learning tasks cover a broad range of domains and an even more extensive range of
applications, from entertainment to extremely safety-critical fields. Thus, Deep Neural …

Analyzing and increasing the reliability of convolutional neural networks on GPUs

FF dos Santos, PF Pimenta, C Lunardi… - IEEE Transactions …, 2018 - ieeexplore.ieee.org
Graphics processing units (GPUs) are playing a critical role in convolutional neural networks
(CNNs) for image detection. As GPU-enabled CNNs move into safety-critical environments …

Artificial neural networks for space and safety-critical applications: Reliability issues and potential solutions

P Rech - IEEE Transactions on Nuclear Science, 2024 - ieeexplore.ieee.org
Machine learning is among the greatest advancements in computer science and
engineering and is today used to classify or detect objects, a key feature in autonomous …

Understanding GPU errors on large-scale HPC systems and the implications for system design and operation

D Tiwari, S Gupta, J Rogers, D Maxwell… - 2015 IEEE 21st …, 2015 - ieeexplore.ieee.org
Increase in graphics hardware performance and improvements in programmability has
enabled GPUs to evolve from a graphics-specific accelerator to a general-purpose …

High energy and thermal neutron sensitivity of google tensor processing units

RLR Junior, S Malde, C Cazzaniga… - … on Nuclear Science, 2022 - ieeexplore.ieee.org
In this article, we investigate the reliability of Google's coral tensor processing units (TPUs)
to both high-energy atmospheric neutrons (at ChipIR) and thermal neutrons from a pulsed …

Soft error resilience of deep residual networks for object recognition

Y Ibrahim, H Wang, M Bai, Z Liu, J Wang, Z Yang… - IEEE …, 2020 - ieeexplore.ieee.org
Convolutional Neural Networks (CNNs) have truly gained attention in object recognition and
object classification in particular. When being implemented on Graphics Processing Units …

Evaluation and mitigation of radiation-induced soft errors in graphics processing units

DAGG de Oliveira, LL Pilla, T Santini… - IEEE Transactions on …, 2015 - ieeexplore.ieee.org
Graphics processing units (GPUs) are increasingly attractive for both safety-critical and High-
Performance Computing applications. GPU reliability is a primary concern for both the …

Evaluation and mitigation of soft-errors in neural network-based object detection in three GPU architectures

FF dos Santos, L Draghetti, L Weigel… - 2017 47th Annual …, 2017 - ieeexplore.ieee.org
In this paper, we evaluate the reliability of the You Only Look Once (YOLO) object detection
framework. We have exposed to controlled neutron beams GPUs designed with three …

Impact of GPUs parallelism management on safety-critical and HPC applications reliability

P Rech, LL Pilla, POA Navaux… - 2014 44th Annual IEEE …, 2014 - ieeexplore.ieee.org
Graphics Processing Units (GPUs) offer high computational power but require high
scheduling strain to manage parallel processes, which increases the GPU cross section …

Experimental and analytical study of xeon phi reliability

D Oliveira, L Pilla, N DeBardeleben… - Proceedings of the …, 2017 - dl.acm.org
We present an in-depth analysis of transient faults effects on HPC applications in Intel Xeon
Phi processors based on radiation experiments and high-level fault injection. Besides …