GPU devices for safety-critical systems: A survey

J Perez-Cerrolaza, J Abella, L Kosmidis… - ACM Computing …, 2022 - dl.acm.org
Graphics Processing Unit (GPU) devices and their associated software programming
languages and frameworks can deliver the computing performance required to facilitate the …

Making convolutions resilient via algorithm-based error detection techniques

SKS Hari, MB Sullivan, T Tsai… - IEEE Transactions on …, 2021 - ieeexplore.ieee.org
Convolutional Neural Networks (CNNs) are being increasingly used in safety-critical and
high-performance computing systems. As such systems require high levels of resilience to …

GreenMM: energy efficient GPU matrix multiplication through undervolting

H Zamani, Y Liu, D Tripathy, L Bhuyan… - Proceedings of the ACM …, 2019 - dl.acm.org
The current trend of ever-increasing performance in scientific applications comes with
tremendous growth in energy consumption. In this paper, we present GreenMM framework …

Enabling software resilience in gpgpu applications via partial thread protection

L Yang, B Nie, A Jog, E Smirni - 2021 IEEE/ACM 43rd …, 2021 - ieeexplore.ieee.org
Graphics Processing Units (GPUs) are widely used by various applications in a broad
variety of fields to accelerate their computation but remain susceptible to transient hardware …

TSM2: optimizing tall-and-skinny matrix-matrix multiplication on GPUs

J Chen, N **ong, X Liang, D Tao, S Li… - Proceedings of the …, 2019 - dl.acm.org
Linear algebra operations have been widely used in big data analytics and scientific
computations. Many works have been done on optimizing linear algebra operations on …

Tsm2x: High-performance tall-and-skinny matrix–matrix multiplication on gpus

C Rivera, J Chen, N **ong, J Zhang, SL Song… - Journal of Parallel and …, 2021 - Elsevier
Linear algebra operations have been widely used in big data analytics and scientific
computations. Many works have been done on optimizing linear algebra operations on …

Fault tolerant one-sided matrix decompositions on heterogeneous systems with gpus

J Chen, H Li, S Li, X Liang, P Wu, D Tao… - … Conference for High …, 2018 - ieeexplore.ieee.org
Current algorithm-based fault tolerance (ABFT) approach for one-sided matrix
decomposition on heterogeneous systems with GPUs have following limitations:(1) they do …

Accelerating multigrid-based hierarchical scientific data refactoring on gpus

J Chen, L Wan, X Liang, B Whitney… - 2021 IEEE …, 2021 - ieeexplore.ieee.org
Rapid growth in scientific data and a widening gap between computational speed and I/O
bandwidth make it increasingly infeasible to store and share all data produced by scientific …

Software approaches for resilience of high performance computing systems: a survey

J Jia, Y Liu, G Zhang, Y Gao, D Qian - Frontiers of Computer Science, 2023 - Springer
With the scaling up of high-performance computing systems in recent years, their reliability
has been descending continuously. Therefore, system resilience has been regarded as one …

FT-iSort: efficient fault tolerance for introsort

S Li, H Li, X Liang, J Chen, E Giem, K Ouyang… - Proceedings of the …, 2019 - dl.acm.org
Introspective sorting is a ubiquitous sorting algorithm which underlies many large scale
distributed systems. Hardware-mediated soft errors can result in comparison and memory …