A survey on modeling and improving reliability of DNN algorithms and accelerators

S Mittal - Journal of Systems Architecture, 2020 - Elsevier
As DNNs become increasingly common in mission-critical applications, ensuring their
reliable operation has become crucial. Conventional resilience techniques fail to account for …

Short-dot: Computing large linear transforms distributedly using coded short dot products

S Dutta, V Cadambe, P Grover - Advances In Neural …, 2016 - proceedings.neurips.cc
Faced with saturation of Moore's law and increasing size and dimension of data, system
designers have increasingly resorted to parallel and distributed computing to reduce …

On the optimal recovery threshold of coded matrix multiplication

S Dutta, M Fahim, F Haddadpour… - IEEE Transactions …, 2019 - ieeexplore.ieee.org
We provide novel coded computation strategies for distributed matrix-matrix products that
outperform the recent “Polynomial code” constructions in recovery threshold, ie, the required …

Coded computing: Mitigating fundamental bottlenecks in large-scale distributed computing and machine learning

S Li, S Avestimehr - Foundations and Trends® in …, 2020 - nowpublishers.com
We introduce the concept of “coded computing”, a novel computing paradigm that utilizes
coding theory to effectively inject and leverage data/computation redundancy to mitigate …

A unified coded deep neural network training strategy based on generalized polydot codes

S Dutta, Z Bai, H Jeong, TM Low… - 2018 IEEE International …, 2018 - ieeexplore.ieee.org
This paper has two main contributions. First, we propose a novel coding technique-
Generalized PolyDot-for matrix-vector products that advances on existing techniques for …

Simulating low precision floating-point arithmetic

NJ Higham, S Pranesh - SIAM Journal on Scientific Computing, 2019 - SIAM
The half-precision (fp16) floating-point format, defined in the 2008 revision of the IEEE
standard for floating-point arithmetic, and a more recently proposed half-precision format …

Arithmetic-intensity-guided fault tolerance for neural network inference on GPUs

J Kosaian, KV Rashmi - Proceedings of the International Conference for …, 2021 - dl.acm.org
Neural networks (NNs) are increasingly employed in safety-critical domains and in
environments prone to unreliability (eg, soft errors), such as on spacecraft. Therefore, it is …

Soft error reliability analysis of vision transformers

X Xue, C Liu, Y Wang, B Yang, T Luo… - … Transactions on Very …, 2023 - ieeexplore.ieee.org
Vision transformers (ViTs) that leverage self-attention mechanism have shown superior
performance on many classical vision tasks compared to convolutional neural networks …

Exploring Winograd convolution for cost-effective neural network fault tolerance

X Xue, C Liu, B Liu, H Huang, Y Wang… - … Transactions on Very …, 2023 - ieeexplore.ieee.org
Winograd is generally utilized to optimize convolution performance and computational
efficiency because of the reduced multiplication operations, but the reliability issues brought …

An application of storage-optimal matdot codes for coded matrix multiplication: Fast k-nearest neighbors estimation

U Sheth, S Dutta, M Chaudhari, H Jeong… - … Conference on Big …, 2018 - ieeexplore.ieee.org
We propose a novel application of coded computing to the problem of the nearest neighbor
estimation using MatDot Codes (Fahim et al., Allerton'17) that are known to be optimal for …