FT-CNN: Algorithm-based fault tolerance for convolutional neural networks

K Zhao, S Di, S Li, X Liang, Y Zhai… - … on Parallel and …, 2020 - ieeexplore.ieee.org
Convolutional neural networks (CNNs) are becoming more and more important for solving
challenging and critical problems in many fields. CNN inference applications have been …

Tsm2x: High-performance tall-and-skinny matrix–matrix multiplication on gpus

C Rivera, J Chen, N **ong, J Zhang, SL Song… - Journal of Parallel and …, 2021 - Elsevier
Linear algebra operations have been widely used in big data analytics and scientific
computations. Many works have been done on optimizing linear algebra operations on …

Accelerating multigrid-based hierarchical scientific data refactoring on gpus

J Chen, L Wan, X Liang, B Whitney… - 2021 IEEE …, 2021 - ieeexplore.ieee.org
Rapid growth in scientific data and a widening gap between computational speed and I/O
bandwidth make it increasingly infeasible to store and share all data produced by scientific …

Software approaches for resilience of high performance computing systems: a survey

J Jia, Y Liu, G Zhang, Y Gao, D Qian - Frontiers of Computer Science, 2023 - Springer
With the scaling up of high-performance computing systems in recent years, their reliability
has been descending continuously. Therefore, system resilience has been regarded as one …

Efficient soft-error detection for low-precision deep learning recommendation models

S Li, J Huang, PTP Tang, D Khudia… - … Conference on Big …, 2022 - ieeexplore.ieee.org
Soft error, namely silent corruption of signal or datum in a computer system, cannot be
caverlierly ignored as compute and communication density grow exponentially. Soft error …

Comparative analysis of soft-error sensitivity in LU decomposition algorithms on diverse GPUs

G Leon, JM Badia, JA Belloch, A Lindoso… - The Journal of …, 2024 - Springer
Graphics processing units (GPUs) have become integral to embedded systems and
supercomputing centres due to their large memory, cutting-edge technology and high …

Reliability evaluation of LU decomposition on GPU-accelerated system-on-chip under proton irradiation

JM Badia, G Leon, JA Belloch… - … on Nuclear Science, 2022 - ieeexplore.ieee.org
Graphic processing units (GPUs) have become a basic accelerator both in high-
performance nodes and low-power system-on-chip (SoC). They provide massive data …

FT-iSort: efficient fault tolerance for introsort

S Li, H Li, X Liang, J Chen, E Giem, K Ouyang… - Proceedings of the …, 2019 - dl.acm.org
Introspective sorting is a ubiquitous sorting algorithm which underlies many large scale
distributed systems. Hardware-mediated soft errors can result in comparison and memory …

Anomaly detection in scientific datasets using sparse representation

A Moon, M Kim, J Chen, SW Son - Proceedings of the First Workshop on …, 2023 - dl.acm.org
As the size and complexity of high-performance computing (HPC) systems keep growing,
scientists' ability to trust the data produced is paramount due to potential data corruption for …

Improving energy saving of one-sided matrix decompositions on cpu-gpu heterogeneous systems

J Chen, X Liang, K Zhao, HZ Sabzi, L Bhuyan… - Proceedings of the 28th …, 2023 - dl.acm.org
One-sided dense matrix decompositions (eg, Cholesky, LU, and QR) are the key
components in scientific computing in many different fields. Although their design has been …