RUAD: Unsupervised anomaly detection in HPC systems
The increasing complexity of modern high-performance computing (HPC) systems
necessitates the introduction of automated and data-driven methodologies to support system …
necessitates the introduction of automated and data-driven methodologies to support system …
Prodigy: Towards unsupervised anomaly detection in production hpc systems
Performance variations caused by anomalies in modern High Performance Computing
(HPC) systems lead to decreased efficiency, impaired application performance, and …
(HPC) systems lead to decreased efficiency, impaired application performance, and …
Harnessing federated learning for anomaly detection in supercomputer nodes
High-performance computing (HPC) systems are a crucial component of modern society,
with a significant impact in areas ranging from economics to scientific research, thanks to …
with a significant impact in areas ranging from economics to scientific research, thanks to …
A federated learning approach for anomaly detection in high performance computing
High Performance Computing (HPC) systems are complex machines that need to be
operated at their maximum potential to recoup their investment cost and to mitigate their …
operated at their maximum potential to recoup their investment cost and to mitigate their …
Albadross: Active learning based anomaly diagnosis for production hpc systems
Diagnosing causes of performance variations in High-Performance Computing (HPC)
systems is a daunting chal-lenge due to the systems' scale and complexity. Variations in …
systems is a daunting chal-lenge due to the systems' scale and complexity. Variations in …
Runtime Performance Anomaly Diagnosis in Production HPC Systems Using Active Learning
With the increasing scale and complexity of High-Performance Computing (HPC) systems,
performance variations in applications caused by anomalies have become significant …
performance variations in applications caused by anomalies have become significant …
Towards anomaly detection for monitoring power consumption in HPC facilities
N Sukhija, E Bautista, D Butz, C Whitney - Proceedings of the 14th …, 2022 - dl.acm.org
Given the increasing complexity and the heterogeneity of today's computing system
infrastructure, power efficiency and fault tolerance remain the top challenges of an High …
infrastructure, power efficiency and fault tolerance remain the top challenges of an High …
Towards Practical Machine Learning Frameworks for Performance Diagnostics in Supercomputers
Supercomputers are highly sophisticated computing systems designed to handle complex
and computationally intensive tasks. Despite their tremendous efficiency, performance …
and computationally intensive tasks. Despite their tremendous efficiency, performance …
LSTM-Based Unsupervised Anomaly Detection in High-Performance Computing: A Federated Learning Approach
High-Performance Computing (HPC) systems are intricate machines that must be run at
maximum efficiency to justify their high cost and to minimize environmental impact. Any …
maximum efficiency to justify their high cost and to minimize environmental impact. Any …
Exploring the Utility of Graph Methods in HPC Thermal Modeling
This work critically examines several approaches to temperature prediction for High-
Performance Computing (HPC) systems, focusing on component-level and holistic models …
Performance Computing (HPC) systems, focusing on component-level and holistic models …