[PDF][PDF] Toward exascale resilience: 2014 update

F Cappello, A Geist, W Gropp, S Kale… - Supercomputing …, 2014 - superfri.susu.ru
Resilience is a major roadblock for HPC executions on future exascale systems. These
systems will typically gather millions of CPU cores running up to a billion threads …

IT infrastructure anomaly detection and failure handling: A systematic literature review focusing on datasets, log preprocessing, machine & deep learning approaches …

DA Bhanage, AV Pawar, K Kotecha - IEEE Access, 2021 - ieeexplore.ieee.org
Nowadays, reliability assurance is crucial in components of IT infrastructures. Unavailability
of any element or connection results in downtime and triggers monetary and performance …

Desh: deep learning for system health prediction of lead times to failure in hpc

A Das, F Mueller, C Siegel, A Vishnu - Proceedings of the 27th …, 2018 - dl.acm.org
Today's large-scale supercomputers encounter faults on a daily basis. Exascale systems are
likely to experience even higher fault rates due to increased component count and density …

Doomsday: Predicting which node will fail when on supercomputers

A Das, F Mueller, P Hargrove… - … Conference for High …, 2018 - ieeexplore.ieee.org
Predicting which node will fail and how soon remains a challenge for HPC resilience, yet
may pave the way to exploiting proactive remedies before jobs fail. Not only for increasing …

Aarohi: Making real-time node failure prediction feasible

A Das, F Mueller, B Rountree - 2020 IEEE International Parallel …, 2020 - ieeexplore.ieee.org
Large-scale production systems are well known to encounter node failures, which affect
compute capacity and energy. Both in HPC systems and enterprise data centers, combating …

Exploit both {SMART} Attributes and {NAND} Flash Wear Characteristics to Effectively Forecast {SSD-based} Storage Failures in Clusters

Y Gu, C Wu, X He - … USENIX Annual Technical Conference (USENIX ATC …, 2024 - usenix.org
Solid State Drives (SSDs) based on flash technology are extensively employed as high-
performance storage solutions in supercomputing data centers. However, SSD failures are …

Time machine: Generative real-time model for failure (and lead time) prediction in hpc systems

KA Alharthi, A Jhumka, S Di, L Gui… - 2023 53rd Annual …, 2023 - ieeexplore.ieee.org
High Performance Computing (HPC) systems generate a large amount of unstructured/
alphanumeric log messages that capture the health state of their components. Due to their …

[Retracted] Classification and Prediction of Software Incidents Using Machine Learning Techniques

S Ali, M Adeel, S Johar, M Zeeshan… - Security and …, 2021 - Wiley Online Library
An incident, in the perception of information technology, is an event that is not part of a
normal process and disrupts operational procedure. This research work particularly focuses …

Clairvoyant: a log-based transformer-decoder for failure prediction in large-scale systems

KA Alharthi, A Jhumka, S Di, F Cappello - Proceedings of the 36th ACM …, 2022 - dl.acm.org
System failures are expected to be frequent in the exascale era such as current Petascale
systems. The health of such systems is usually determined from challenging analysis of …

Workload analysis of blue waters

MD Jones, JP White, M Innus, RL DeLeon… - arxiv preprint arxiv …, 2017 - arxiv.org
Blue Waters is a Petascale-level supercomputer whose mission is to enable the national
scientific and research community to solve" grand challenge" problems that are orders of …