Counterfactual explanations for multivariate time series
Multivariate time series are used in many science and engineering domains, including
health-care, astronomy, and high-performance computing. A recent trend is to use machine …
health-care, astronomy, and high-performance computing. A recent trend is to use machine …
Smart predictive maintenance for high-performance computing systems: a literature review
Predictive maintenance is an invaluable tool to preserve the health of mission critical assets
while minimizing the operational costs of scheduled intervention. Artificial intelligence …
while minimizing the operational costs of scheduled intervention. Artificial intelligence …
A Survey on Failure Analysis and Fault Injection in AI Systems
The rapid advancement of Artificial Intelligence (AI) has led to its integration into various
areas, especially with Large Language Models (LLMs) significantly enhancing capabilities …
areas, especially with Large Language Models (LLMs) significantly enhancing capabilities …
Job characteristics on large-scale systems: long-term analysis, quantification, and implications
HPC workload analysis and resource consumption characteristics are the key to driving
better operation practices, system procurement decisions, and designing effective resource …
better operation practices, system procurement decisions, and designing effective resource …
AI-enabling workloads on large-scale GPU-accelerated system: Characterization, opportunities, and implications
Production high-performance computing (HPC) systems are adopting and integrating GPUs
into their design to accommodate artificial intelligence (AI), machine learning, and data …
into their design to accommodate artificial intelligence (AI), machine learning, and data …
Revealing power, energy and thermal dynamics of a 200pf pre-exascale supercomputer
As we approach the exascale computing era, the focused understanding of power
consumption and its overall constraint on HPC architectures and applications are becoming …
consumption and its overall constraint on HPC architectures and applications are becoming …
SSD failures in the field: symptoms, causes, and prediction models
In recent years, solid state drives (SSDs) have become a staple of high-performance data
centers for their speed and energy efficiency. In this work, we study the failure characteristics …
centers for their speed and energy efficiency. In this work, we study the failure characteristics …
Deep validation: Toward detecting real-world corner cases for deep neural networks
The exceptional performance of Deep neural networks (DNNs) encourages their
deployment in safety-and dependability-critical systems. However, DNNs often demonstrate …
deployment in safety-and dependability-critical systems. However, DNNs often demonstrate …
Lifespan and failures of SSDs and HDDs: similarities, differences, and prediction models
Data center downtime typically centers around IT equipment failure. Storage devices are the
most frequently failing components in data centers. We present a comparative study of hard …
most frequently failing components in data centers. We present a comparative study of hard …
Cost-aware prediction of uncorrected DRAM errors in the field
This paper presents and evaluates a method to predict DRAM uncorrected errors, a leading
cause of hardware failures in large-scale HPC clusters. The method uses a random forest …
cause of hardware failures in large-scale HPC clusters. The method uses a random forest …