GPU devices for safety-critical systems: A survey
Graphics Processing Unit (GPU) devices and their associated software programming
languages and frameworks can deliver the computing performance required to facilitate the …
languages and frameworks can deliver the computing performance required to facilitate the …
Making convolutions resilient via algorithm-based error detection techniques
Convolutional Neural Networks (CNNs) are being increasingly used in safety-critical and
high-performance computing systems. As such systems require high levels of resilience to …
high-performance computing systems. As such systems require high levels of resilience to …
GreenMM: energy efficient GPU matrix multiplication through undervolting
The current trend of ever-increasing performance in scientific applications comes with
tremendous growth in energy consumption. In this paper, we present GreenMM framework …
tremendous growth in energy consumption. In this paper, we present GreenMM framework …
Enabling software resilience in gpgpu applications via partial thread protection
Graphics Processing Units (GPUs) are widely used by various applications in a broad
variety of fields to accelerate their computation but remain susceptible to transient hardware …
variety of fields to accelerate their computation but remain susceptible to transient hardware …
TSM2: optimizing tall-and-skinny matrix-matrix multiplication on GPUs
Linear algebra operations have been widely used in big data analytics and scientific
computations. Many works have been done on optimizing linear algebra operations on …
computations. Many works have been done on optimizing linear algebra operations on …
Tsm2x: High-performance tall-and-skinny matrix–matrix multiplication on gpus
Linear algebra operations have been widely used in big data analytics and scientific
computations. Many works have been done on optimizing linear algebra operations on …
computations. Many works have been done on optimizing linear algebra operations on …
Fault tolerant one-sided matrix decompositions on heterogeneous systems with gpus
Current algorithm-based fault tolerance (ABFT) approach for one-sided matrix
decomposition on heterogeneous systems with GPUs have following limitations:(1) they do …
decomposition on heterogeneous systems with GPUs have following limitations:(1) they do …
Accelerating multigrid-based hierarchical scientific data refactoring on gpus
Rapid growth in scientific data and a widening gap between computational speed and I/O
bandwidth make it increasingly infeasible to store and share all data produced by scientific …
bandwidth make it increasingly infeasible to store and share all data produced by scientific …
Software approaches for resilience of high performance computing systems: a survey
With the scaling up of high-performance computing systems in recent years, their reliability
has been descending continuously. Therefore, system resilience has been regarded as one …
has been descending continuously. Therefore, system resilience has been regarded as one …
FT-iSort: efficient fault tolerance for introsort
Introspective sorting is a ubiquitous sorting algorithm which underlies many large scale
distributed systems. Hardware-mediated soft errors can result in comparison and memory …
distributed systems. Hardware-mediated soft errors can result in comparison and memory …