RUAD: Unsupervised anomaly detection in HPC systems

M Molan, A Borghesi, D Cesarini, L Benini… - Future Generation …, 2023 - Elsevier
The increasing complexity of modern high-performance computing (HPC) systems
necessitates the introduction of automated and data-driven methodologies to support system …

Prodigy: Towards unsupervised anomaly detection in production hpc systems

B Aksar, E Sencan, B Schwaller, O Aaziz… - Proceedings of the …, 2023 - dl.acm.org
Performance variations caused by anomalies in modern High Performance Computing
(HPC) systems lead to decreased efficiency, impaired application performance, and …

Harnessing federated learning for anomaly detection in supercomputer nodes

E Farooq, M Milano, A Borghesi - Future Generation Computer Systems, 2024 - Elsevier
High-performance computing (HPC) systems are a crucial component of modern society,
with a significant impact in areas ranging from economics to scientific research, thanks to …

A federated learning approach for anomaly detection in high performance computing

E Farooq, A Borghesi - 2023 IEEE 35th International …, 2023 - ieeexplore.ieee.org
High Performance Computing (HPC) systems are complex machines that need to be
operated at their maximum potential to recoup their investment cost and to mitigate their …

Albadross: Active learning based anomaly diagnosis for production hpc systems

B Aksar, E Sencan, B Schwaller, O Aaziz… - 2022 IEEE …, 2022 - ieeexplore.ieee.org
Diagnosing causes of performance variations in High-Performance Computing (HPC)
systems is a daunting chal-lenge due to the systems' scale and complexity. Variations in …

Runtime Performance Anomaly Diagnosis in Production HPC Systems Using Active Learning

B Aksar, E Sencan, B Schwaller, O Aaziz… - … on Parallel and …, 2024 - ieeexplore.ieee.org
With the increasing scale and complexity of High-Performance Computing (HPC) systems,
performance variations in applications caused by anomalies have become significant …

Towards anomaly detection for monitoring power consumption in HPC facilities

N Sukhija, E Bautista, D Butz, C Whitney - Proceedings of the 14th …, 2022 - dl.acm.org
Given the increasing complexity and the heterogeneity of today's computing system
infrastructure, power efficiency and fault tolerance remain the top challenges of an High …

Towards Practical Machine Learning Frameworks for Performance Diagnostics in Supercomputers

B Aksar, E Sencan, B Schwaller, VJ Leung… - Proceedings of the First …, 2023 - dl.acm.org
Supercomputers are highly sophisticated computing systems designed to handle complex
and computationally intensive tasks. Despite their tremendous efficiency, performance …

LSTM-Based Unsupervised Anomaly Detection in High-Performance Computing: A Federated Learning Approach

E Farooq, A Borghesi - 2024 IEEE International Conference on …, 2024 - ieeexplore.ieee.org
High-Performance Computing (HPC) systems are intricate machines that must be run at
maximum efficiency to justify their high cost and to minimize environmental impact. Any …

Exploring the Utility of Graph Methods in HPC Thermal Modeling

B Guindani, M Molan, A Bartolini, L Benini - Companion of the 15th ACM …, 2024 - dl.acm.org
This work critically examines several approaches to temperature prediction for High-
Performance Computing (HPC) systems, focusing on component-level and holistic models …