A systematic literature review on hardware reliability assessment methods for deep neural networks

MH Ahmadilivani, M Taheri, J Raik… - ACM Computing …, 2024 - dl.acm.org
Artificial Intelligence (AI) and, in particular, Machine Learning (ML), have emerged to be
utilized in various applications due to their capability to learn how to solve complex …

The case for lifetime reliability-aware microprocessors

J Srinivasan, SV Adve, P Bose, JA Rivers - ACM SIGARCH Computer …, 2004 - dl.acm.org
Ensuring long processor lifetimes by limiting failuresdue to wear-out related hard errors is a
critical requirementfor all microprocessor manufacturers. We observethat continuous device …

Understanding and mitigating hardware failures in deep learning training systems

Y He, M Hutton, S Chan, R De Gruijl… - Proceedings of the 50th …, 2023 - dl.acm.org
Deep neural network (DNN) training workloads are increasingly susceptible to hardware
failures in datacenters. For example, Google experienced" mysterious, difficult to identify …

[HTML][HTML] Resilience of deep learning applications: A systematic literature review of analysis and hardening techniques

C Bolchini, L Cassano, A Miele - Computer Science Review, 2024 - Elsevier
Abstract Machine Learning (ML) is currently being exploited in numerous applications, being
one of the most effective Artificial Intelligence (AI) technologies used in diverse fields, such …

Exploring Winograd convolution for cost-effective neural network fault tolerance

X Xue, C Liu, B Liu, H Huang, Y Wang… - … Transactions on Very …, 2023 - ieeexplore.ieee.org
Winograd is generally utilized to optimize convolution performance and computational
efficiency because of the reduced multiplication operations, but the reliability issues brought …

Structural coding: A low-cost scheme to protect cnns from large-granularity memory faults

A Asgari Khoshouyeh, F Geissler, S Qutub… - Proceedings of the …, 2023 - dl.acm.org
The advent of High-Performance Computing has led to the adoption of Convolutional Neural
Networks (CNNs) in safety-critical applications such as autonomous vehicles. However …

Transient-fault-aware design and training to enhance dnns reliability with zero-overhead

N Cavagnero, F Dos Santos, M Ciccone… - 2022 IEEE 28th …, 2022 - ieeexplore.ieee.org
Deep Neural Networks (DNNs) enable a wide series of technological advancements,
ranging from clinical imaging, to predictive industrial maintenance and autonomous driving …

Thales: Formulating and estimating architectural vulnerability factors for dnn accelerators

A Tyagi, Y Gan, S Liu, B Yu, P Whatmough… - arxiv preprint arxiv …, 2022 - arxiv.org
As Deep Neural Networks (DNNs) are increasingly deployed in safety critical and privacy
sensitive applications such as autonomous driving and biometric authentication, it is critical …

Soft error reliability analysis of vision transformers

X Xue, C Liu, Y Wang, B Yang, T Luo… - … Transactions on Very …, 2023 - ieeexplore.ieee.org
Vision transformers (ViTs) that leverage self-attention mechanism have shown superior
performance on many classical vision tasks compared to convolutional neural networks …

Lltfi: Framework agnostic fault injection for machine learning applications (tools and artifact track)

UK Agarwal, A Chan… - 2022 IEEE 33rd …, 2022 - ieeexplore.ieee.org
As machine learning (ML) has become more preva-lent across many critical domains, so
has the need to understand ML applications' resilience. While prior work like TensorFI [1] …