FT-CNN: Algorithm-based fault tolerance for convolutional neural networks
Convolutional neural networks (CNNs) are becoming more and more important for solving
challenging and critical problems in many fields. CNN inference applications have been …
challenging and critical problems in many fields. CNN inference applications have been …
Tsm2x: High-performance tall-and-skinny matrix–matrix multiplication on gpus
Linear algebra operations have been widely used in big data analytics and scientific
computations. Many works have been done on optimizing linear algebra operations on …
computations. Many works have been done on optimizing linear algebra operations on …
Accelerating multigrid-based hierarchical scientific data refactoring on gpus
Rapid growth in scientific data and a widening gap between computational speed and I/O
bandwidth make it increasingly infeasible to store and share all data produced by scientific …
bandwidth make it increasingly infeasible to store and share all data produced by scientific …
Software approaches for resilience of high performance computing systems: a survey
With the scaling up of high-performance computing systems in recent years, their reliability
has been descending continuously. Therefore, system resilience has been regarded as one …
has been descending continuously. Therefore, system resilience has been regarded as one …
Efficient soft-error detection for low-precision deep learning recommendation models
Soft error, namely silent corruption of signal or datum in a computer system, cannot be
caverlierly ignored as compute and communication density grow exponentially. Soft error …
caverlierly ignored as compute and communication density grow exponentially. Soft error …
Comparative analysis of soft-error sensitivity in LU decomposition algorithms on diverse GPUs
Graphics processing units (GPUs) have become integral to embedded systems and
supercomputing centres due to their large memory, cutting-edge technology and high …
supercomputing centres due to their large memory, cutting-edge technology and high …
Reliability evaluation of LU decomposition on GPU-accelerated system-on-chip under proton irradiation
Graphic processing units (GPUs) have become a basic accelerator both in high-
performance nodes and low-power system-on-chip (SoC). They provide massive data …
performance nodes and low-power system-on-chip (SoC). They provide massive data …
FT-iSort: efficient fault tolerance for introsort
Introspective sorting is a ubiquitous sorting algorithm which underlies many large scale
distributed systems. Hardware-mediated soft errors can result in comparison and memory …
distributed systems. Hardware-mediated soft errors can result in comparison and memory …
Anomaly detection in scientific datasets using sparse representation
A Moon, M Kim, J Chen, SW Son - Proceedings of the First Workshop on …, 2023 - dl.acm.org
As the size and complexity of high-performance computing (HPC) systems keep growing,
scientists' ability to trust the data produced is paramount due to potential data corruption for …
scientists' ability to trust the data produced is paramount due to potential data corruption for …
Improving energy saving of one-sided matrix decompositions on cpu-gpu heterogeneous systems
One-sided dense matrix decompositions (eg, Cholesky, LU, and QR) are the key
components in scientific computing in many different fields. Although their design has been …
components in scientific computing in many different fields. Although their design has been …