Doomsday: Predicting which node will fail when on supercomputers
Predicting which node will fail and how soon remains a challenge for HPC resilience, yet
may pave the way to exploiting proactive remedies before jobs fail. Not only for increasing …
may pave the way to exploiting proactive remedies before jobs fail. Not only for increasing …
3-dimensional root cause diagnosis via co-analysis
With the growth of system size and complexity, reliability has become a major concern for
large-scale systems. Upon the occurrence of failure, system administrators typically trace the …
large-scale systems. Upon the occurrence of failure, system administrators typically trace the …
LPV: A Log Parsing Framework Based on Vectorization
Logs are pervasive in modern computing systems, and are valuable to service and system
management. Nevertheless, with the rapidly growing size and complexity of computing …
management. Nevertheless, with the rapidly growing size and complexity of computing …
Exploring void search for fault detection on extreme scale systems
E Berrocal, L Yu, S Wallace… - 2014 IEEE International …, 2014 - ieeexplore.ieee.org
Mean Time Between Failures (MTBF), now calculated in days or hours, is expected to drop
to minutes on exascale machines. The advancement of resilience technologies greatly …
to minutes on exascale machines. The advancement of resilience technologies greatly …
Reliability-aware speedup models for parallel applications with coordinated checkpointing/restart
Speedup models are powerful analytical tools for evaluating and predicting the performance
of parallel applications. Unfortunately, the well-known speedup models like Amdahl's law …
of parallel applications. Unfortunately, the well-known speedup models like Amdahl's law …
Converting unstructured system logs into structured event list for anomaly detection
System logs provide invaluable resources for understanding system behavior and detecting
anomalies on high performance computing (HPC) systems. As HPC systems continue to …
anomalies on high performance computing (HPC) systems. As HPC systems continue to …
Self Adjusting Log Observability for Cloud Native Applications
With the increasing complexity of modern applications, particularly those relying on
microservices architectures, the volume of observability data, encompassing logs, metrics …
microservices architectures, the volume of observability data, encompassing logs, metrics …
On preempting advanced persistent threats using probabilistic graphical models
P Cao - arxiv preprint arxiv:1903.08826, 2019 - arxiv.org
This paper presents PULSAR, a framework for pre-empting Advanced Persistent Threats
(APTs). PULSAR employs a probabilistic graphical model (specifically a Factor Graph) to …
(APTs). PULSAR employs a probabilistic graphical model (specifically a Factor Graph) to …
Event block identification and analysis for effective anomaly detection to build reliable HPC systems
System logs provide invaluable resources for understanding system behavior and detecting
anomalies on high performance computing (HPC) systems. As HPC systems continue to …
anomalies on high performance computing (HPC) systems. As HPC systems continue to …
System failure prediction through rare-events elastic-net logistic regression
Predicting failures in a distributed system based on previous events through logistic
regression is a standard approach in literature. This technique is not reliable, though, in two …
regression is a standard approach in literature. This technique is not reliable, though, in two …