GPU devices for safety-critical systems: A survey
Graphics Processing Unit (GPU) devices and their associated software programming
languages and frameworks can deliver the computing performance required to facilitate the …
languages and frameworks can deliver the computing performance required to facilitate the …
Survey on redundancy based-fault tolerance methods for processors and hardware accelerators-trends in quantum computing, heterogeneous systems and reliability
S Venkatesha, R Parthasarathi - ACM Computing Surveys, 2024 - dl.acm.org
Rapid progress in CMOS technology since the late 1990s has increased the vulnerability of
processors toward faults. Subsequently, the focus of computer architects has shifted toward …
processors toward faults. Subsequently, the focus of computer architects has shifted toward …
Analyzing and increasing the reliability of convolutional neural networks on GPUs
Graphics processing units (GPUs) are playing a critical role in convolutional neural networks
(CNNs) for image detection. As GPU-enabled CNNs move into safety-critical environments …
(CNNs) for image detection. As GPU-enabled CNNs move into safety-critical environments …
Memory errors in modern systems: The good, the bad, and the ugly
Several recent publications have shown that hardware faults in the memory subsystem are
commonplace. These faults are predicted to become more frequent in future systems that …
commonplace. These faults are predicted to become more frequent in future systems that …
SASSIFI: An architecture-level fault injection tool for GPU application resilience evaluation
As GPUs become more pervasive in both scalable high-performance computing systems
and safety-critical embedded systems, evaluating and analyzing their resilience to soft errors …
and safety-critical embedded systems, evaluating and analyzing their resilience to soft errors …
Making convolutions resilient via algorithm-based error detection techniques
Convolutional Neural Networks (CNNs) are being increasingly used in safety-critical and
high-performance computing systems. As such systems require high levels of resilience to …
high-performance computing systems. As such systems require high levels of resilience to …
Achieving exascale capabilities through heterogeneous computing
This article provides an overview of AMD's vision for exascale computing, and in particular,
how heterogeneity will play a central role in realizing this vision. Exascale computing …
how heterogeneity will play a central role in realizing this vision. Exascale computing …
Design and Analysis of an APU for Exascale Computing
The challenges to push computing to exaflop levels are difficult given desired targets for
memory capacity, memory bandwidth, power efficiency, reliability, and cost. This paper …
memory capacity, memory bandwidth, power efficiency, reliability, and cost. This paper …
Optimizing software-directed instruction replication for gpu error detection
Application execution on safety-critical and high-performance computer systems must be
resilient to transient errors. As GPUs become more pervasive in such systems, they must …
resilient to transient errors. As GPUs become more pervasive in such systems, they must …
Understanding the Effects of Permanent Faults in GPU's Parallelism Management and Control Units
Modern Graphics Processing Units (GPUs) demand life expectancy extended to many years,
exposing the hardware to aging (ie, permanent faults arising after the end-of-manufacturing …
exposing the hardware to aging (ie, permanent faults arising after the end-of-manufacturing …