FT-CNN: Algorithm-based fault tolerance for convolutional neural networks

K Zhao, S Di, S Li, X Liang, Y Zhai… - … on Parallel and …, 2020 - ieeexplore.ieee.org
Convolutional neural networks (CNNs) are becoming more and more important for solving
challenging and critical problems in many fields. CNN inference applications have been …

Making convolutions resilient via algorithm-based error detection techniques

SKS Hari, MB Sullivan, T Tsai… - IEEE Transactions on …, 2021 - ieeexplore.ieee.org
Convolutional Neural Networks (CNNs) are being increasingly used in safety-critical and
high-performance computing systems. As such systems require high levels of resilience to …

Arithmetic-intensity-guided fault tolerance for neural network inference on GPUs

J Kosaian, KV Rashmi - Proceedings of the International Conference for …, 2021 - dl.acm.org
Neural networks (NNs) are increasingly employed in safety-critical domains and in
environments prone to unreliability (eg, soft errors), such as on spacecraft. Therefore, it is …

Efficient error detection for matrix multiplication with systolic arrays on fpgas

F Libano, P Rech, J Brunhaver - IEEE Transactions on …, 2023 - ieeexplore.ieee.org
Matrix multiplication has always been a cornerstone in computer science. In fact, linear
algebra tools permeate a wide variety of applications: from weather forecasting, to financial …

Low-cost online convolution checksum checker

D Filippas, N Margomenos… - … Transactions on Very …, 2021 - ieeexplore.ieee.org
Managing random hardware faults requires the faults to be detected online, thus simplifying
recovery. Algorithm-based fault tolerance has been proposed as a low-cost mechanism to …

TSM2: optimizing tall-and-skinny matrix-matrix multiplication on GPUs

J Chen, N **ong, X Liang, D Tao, S Li… - Proceedings of the …, 2019 - dl.acm.org
Linear algebra operations have been widely used in big data analytics and scientific
computations. Many works have been done on optimizing linear algebra operations on …

Improving performance of iterative methods by lossy checkponting

D Tao, S Di, X Liang, Z Chen, F Cappello - Proceedings of the 27th …, 2018 - dl.acm.org
Iterative methods are commonly used approaches to solve large, sparse linear systems,
which are fundamental operations for many modern scientific simulations. When the large …

Resiliency in numerical algorithm design for extreme scale simulations

E Agullo, M Altenbernd, H Anzt… - … journal of high …, 2022 - journals.sagepub.com
This work is based on the seminar titled 'Resiliency in Numerical Algorithm Design for
Extreme Scale Simulations' held March 1–6, 2020, at Schloss Dagstuhl, that was attended …

Correcting soft errors online in fast fourier transform

X Liang, J Chen, D Tao, S Li, P Wu, H Li… - Proceedings of the …, 2017 - dl.acm.org
While many algorithm-based fault tolerance (ABFT) schemes have been proposed to detect
soft errors offline in the fast Fourier transform (FFT) after computation finishes, none of the …

Tsm2x: High-performance tall-and-skinny matrix–matrix multiplication on gpus

C Rivera, J Chen, N **ong, J Zhang, SL Song… - Journal of Parallel and …, 2021 - Elsevier
Linear algebra operations have been widely used in big data analytics and scientific
computations. Many works have been done on optimizing linear algebra operations on …