Counterfactual explanations for multivariate time series

E Ates, B Aksar, VJ Leung… - … conference on applied …, 2021 - ieeexplore.ieee.org
Multivariate time series are used in many science and engineering domains, including
health-care, astronomy, and high-performance computing. A recent trend is to use machine …

Smart predictive maintenance for high-performance computing systems: a literature review

ALCD Lima, VM Aranha, CJL Carvalho… - The Journal of …, 2021 - Springer
Predictive maintenance is an invaluable tool to preserve the health of mission critical assets
while minimizing the operational costs of scheduled intervention. Artificial intelligence …

A Survey on Failure Analysis and Fault Injection in AI Systems

G Yu, G Tan, H Huang, Z Zhang, P Chen… - arxiv preprint arxiv …, 2024 - arxiv.org
The rapid advancement of Artificial Intelligence (AI) has led to its integration into various
areas, especially with Large Language Models (LLMs) significantly enhancing capabilities …

Job characteristics on large-scale systems: long-term analysis, quantification, and implications

T Patel, Z Liu, R Kettimuthu, P Rich… - … conference for high …, 2020 - ieeexplore.ieee.org
HPC workload analysis and resource consumption characteristics are the key to driving
better operation practices, system procurement decisions, and designing effective resource …

AI-enabling workloads on large-scale GPU-accelerated system: Characterization, opportunities, and implications

B Li, R Arora, S Samsi, T Patel… - … Symposium on High …, 2022 - ieeexplore.ieee.org
Production high-performance computing (HPC) systems are adopting and integrating GPUs
into their design to accommodate artificial intelligence (AI), machine learning, and data …

Revealing power, energy and thermal dynamics of a 200pf pre-exascale supercomputer

W Shin, V Oles, AM Karimi, JA Ellis… - Proceedings of the …, 2021 - dl.acm.org
As we approach the exascale computing era, the focused understanding of power
consumption and its overall constraint on HPC architectures and applications are becoming …

SSD failures in the field: symptoms, causes, and prediction models

J Alter, J Xue, A Dimnaku, E Smirni - Proceedings of the International …, 2019 - dl.acm.org
In recent years, solid state drives (SSDs) have become a staple of high-performance data
centers for their speed and energy efficiency. In this work, we study the failure characteristics …

Deep validation: Toward detecting real-world corner cases for deep neural networks

W Wu, H Xu, S Zhong, MR Lyu… - 2019 49th Annual IEEE …, 2019 - ieeexplore.ieee.org
The exceptional performance of Deep neural networks (DNNs) encourages their
deployment in safety-and dependability-critical systems. However, DNNs often demonstrate …

Lifespan and failures of SSDs and HDDs: similarities, differences, and prediction models

R Pinciroli, L Yang, J Alter… - IEEE Transactions on …, 2021 - ieeexplore.ieee.org
Data center downtime typically centers around IT equipment failure. Storage devices are the
most frequently failing components in data centers. We present a comparative study of hard …

Cost-aware prediction of uncorrected DRAM errors in the field

I Boixaderas, D Zivanovic, S Moré… - … Conference for High …, 2020 - ieeexplore.ieee.org
This paper presents and evaluates a method to predict DRAM uncorrected errors, a leading
cause of hardware failures in large-scale HPC clusters. The method uses a random forest …