GPU devices for safety-critical systems: A survey

J Perez-Cerrolaza, J Abella, L Kosmidis… - ACM Computing …, 2022 - dl.acm.org
Graphics Processing Unit (GPU) devices and their associated software programming
languages and frameworks can deliver the computing performance required to facilitate the …

A survey of techniques for modeling and improving reliability of computing systems

S Mittal, JS Vetter - IEEE Transactions on Parallel and …, 2015 - ieeexplore.ieee.org
Recent trends of aggressive technology scaling have greatly exacerbated the occurrences
and impact of faults in computing systems. This has madereliability'a first-order design …

Making convolutions resilient via algorithm-based error detection techniques

SKS Hari, MB Sullivan, T Tsai… - IEEE Transactions on …, 2021 - ieeexplore.ieee.org
Convolutional Neural Networks (CNNs) are being increasingly used in safety-critical and
high-performance computing systems. As such systems require high levels of resilience to …

GPU-Qin: A methodology for evaluating the error resilience of GPGPU applications

B Fang, K Pattabiraman, M Ripeanu… - … Analysis of Systems …, 2014 - ieeexplore.ieee.org
While graphics processing units (GPUs) have gained wide adoption as accelerators for
general-purpose applications (GPGPU), the end-to-end reliability implications of their use …

Optimizing software-directed instruction replication for gpu error detection

A Mahmoud, SKS Hari, MB Sullivan… - … Conference for High …, 2018 - ieeexplore.ieee.org
Application execution on safety-critical and high-performance computer systems must be
resilient to transient errors. As GPUs become more pervasive in such systems, they must …

Understanding error propagation in GPGPU applications

G Li, K Pattabiraman, CY Cher… - SC'16: Proceedings of …, 2016 - ieeexplore.ieee.org
GPUs have emerged as general-purpose accelerators in high-performance computing
(HPC) and scientific applications. However, the reliability characteristics of GPU applications …

Real-world design and evaluation of compiler-managed GPU redundant multithreading

J Wadden, A Lyashevsky, S Gurumurthi… - ACM SIGARCH …, 2014 - dl.acm.org
Reliability for general purpose processing on the GPU (GPGPU) is becoming a weak link in
the construction of reliable supercomputer systems. Because hardware protection is …

Arithmetic-intensity-guided fault tolerance for neural network inference on GPUs

J Kosaian, KV Rashmi - Proceedings of the International Conference for …, 2021 - dl.acm.org
Neural networks (NNs) are increasingly employed in safety-critical domains and in
environments prone to unreliability (eg, soft errors), such as on spacecraft. Therefore, it is …

A flexible tensor block coordinate ascent scheme for hypergraph matching

Q Nguyen, A Gautier, M Hein - Proceedings of the IEEE …, 2015 - openaccess.thecvf.com
The estimation of correspondences between two images resp. point sets is a core problem
in computer vision. One way to formulate the problem is graph matching leading to the …

Hauberk: Lightweight silent data corruption error detector for gpgpu

KS Yim, C Pham, M Saleheen… - … Parallel & Distributed …, 2011 - ieeexplore.ieee.org
High performance and relatively low cost of GPU-based platforms provide an attractive
alternative for general purpose high performance computing (HPC). However, the emerging …