GPU devices for safety-critical systems: A survey
Graphics Processing Unit (GPU) devices and their associated software programming
languages and frameworks can deliver the computing performance required to facilitate the …
languages and frameworks can deliver the computing performance required to facilitate the …
A survey of techniques for modeling and improving reliability of computing systems
Recent trends of aggressive technology scaling have greatly exacerbated the occurrences
and impact of faults in computing systems. This has madereliability'a first-order design …
and impact of faults in computing systems. This has madereliability'a first-order design …
Making convolutions resilient via algorithm-based error detection techniques
Convolutional Neural Networks (CNNs) are being increasingly used in safety-critical and
high-performance computing systems. As such systems require high levels of resilience to …
high-performance computing systems. As such systems require high levels of resilience to …
GPU-Qin: A methodology for evaluating the error resilience of GPGPU applications
While graphics processing units (GPUs) have gained wide adoption as accelerators for
general-purpose applications (GPGPU), the end-to-end reliability implications of their use …
general-purpose applications (GPGPU), the end-to-end reliability implications of their use …
Optimizing software-directed instruction replication for gpu error detection
Application execution on safety-critical and high-performance computer systems must be
resilient to transient errors. As GPUs become more pervasive in such systems, they must …
resilient to transient errors. As GPUs become more pervasive in such systems, they must …
Understanding error propagation in GPGPU applications
GPUs have emerged as general-purpose accelerators in high-performance computing
(HPC) and scientific applications. However, the reliability characteristics of GPU applications …
(HPC) and scientific applications. However, the reliability characteristics of GPU applications …
Real-world design and evaluation of compiler-managed GPU redundant multithreading
Reliability for general purpose processing on the GPU (GPGPU) is becoming a weak link in
the construction of reliable supercomputer systems. Because hardware protection is …
the construction of reliable supercomputer systems. Because hardware protection is …
Arithmetic-intensity-guided fault tolerance for neural network inference on GPUs
Neural networks (NNs) are increasingly employed in safety-critical domains and in
environments prone to unreliability (eg, soft errors), such as on spacecraft. Therefore, it is …
environments prone to unreliability (eg, soft errors), such as on spacecraft. Therefore, it is …
A flexible tensor block coordinate ascent scheme for hypergraph matching
The estimation of correspondences between two images resp. point sets is a core problem
in computer vision. One way to formulate the problem is graph matching leading to the …
in computer vision. One way to formulate the problem is graph matching leading to the …
Hauberk: Lightweight silent data corruption error detector for gpgpu
High performance and relatively low cost of GPU-based platforms provide an attractive
alternative for general purpose high performance computing (HPC). However, the emerging …
alternative for general purpose high performance computing (HPC). However, the emerging …