Doomsday: Predicting which node will fail when on supercomputers

A Das, F Mueller, P Hargrove… - … Conference for High …, 2018 - ieeexplore.ieee.org
Predicting which node will fail and how soon remains a challenge for HPC resilience, yet
may pave the way to exploiting proactive remedies before jobs fail. Not only for increasing …

3-dimensional root cause diagnosis via co-analysis

Z Zheng, L Yu, Z Lan, T Jones - … of the 9th international conference on …, 2012 - dl.acm.org
With the growth of system size and complexity, reliability has become a major concern for
large-scale systems. Upon the occurrence of failure, system administrators typically trace the …

LPV: A Log Parsing Framework Based on Vectorization

T **ao, Z Quan, ZJ Wang, K Zhao, X Liao… - … on Network and …, 2023 - ieeexplore.ieee.org
Logs are pervasive in modern computing systems, and are valuable to service and system
management. Nevertheless, with the rapidly growing size and complexity of computing …

Exploring void search for fault detection on extreme scale systems

E Berrocal, L Yu, S Wallace… - 2014 IEEE International …, 2014 - ieeexplore.ieee.org
Mean Time Between Failures (MTBF), now calculated in days or hours, is expected to drop
to minutes on exascale machines. The advancement of resilience technologies greatly …

Reliability-aware speedup models for parallel applications with coordinated checkpointing/restart

Z Zheng, L Yu, Z Lan - IEEE Transactions on Computers, 2014 - ieeexplore.ieee.org
Speedup models are powerful analytical tools for evaluating and predicting the performance
of parallel applications. Unfortunately, the well-known speedup models like Amdahl's law …

Converting unstructured system logs into structured event list for anomaly detection

Z Li, M Davidson, S Fu, S Blanchard… - Proceedings of the 13th …, 2018 - dl.acm.org
System logs provide invaluable resources for understanding system behavior and detecting
anomalies on high performance computing (HPC) systems. As HPC systems continue to …

Self Adjusting Log Observability for Cloud Native Applications

D Pathak, M Verma, A Chakraborty… - 2024 IEEE 17th …, 2024 - ieeexplore.ieee.org
With the increasing complexity of modern applications, particularly those relying on
microservices architectures, the volume of observability data, encompassing logs, metrics …

On preempting advanced persistent threats using probabilistic graphical models

P Cao - arxiv preprint arxiv:1903.08826, 2019 - arxiv.org
This paper presents PULSAR, a framework for pre-empting Advanced Persistent Threats
(APTs). PULSAR employs a probabilistic graphical model (specifically a Factor Graph) to …

Event block identification and analysis for effective anomaly detection to build reliable HPC systems

Z Li, M Davidson, S Fu, S Blanchard… - 2018 IEEE 20th …, 2018 - ieeexplore.ieee.org
System logs provide invaluable resources for understanding system behavior and detecting
anomalies on high performance computing (HPC) systems. As HPC systems continue to …

System failure prediction through rare-events elastic-net logistic regression

JM Navarro, GHA Parada… - 2014 2nd International …, 2014 - ieeexplore.ieee.org
Predicting failures in a distributed system based on previous events through logistic
regression is a standard approach in literature. This technique is not reliable, though, in two …