Failure prediction for HPC systems and applications: Current situation and open issues
As large-scale systems evolve towards post-petascale computing, it is crucial to focus on
providing fault-tolerance strategies that aim to minimize fault's effects on applications. By far …
providing fault-tolerance strategies that aim to minimize fault's effects on applications. By far …
Environmental performance analysis of solid freedom fabrication processes
This paper presents a method for analyzing the environmental performance of solid freeform
fabrication (SFF) processes. In this method, each process is divided into life phases …
fabrication (SFF) processes. In this method, each process is divided into life phases …
Failure prediction by utilizing log analysis: A systematic map** study
D Das, M Schiewe, E Brighton, M Fuller… - Proceedings of the …, 2020 - dl.acm.org
In modern computing, log files provide a wealth of information regarding the past of a
system, including the system failures and security breaches that cost companies and …
system, including the system failures and security breaches that cost companies and …
Fault prediction under the microscope: A closer look into HPC systems
A large percentage of computing capacity in today's large high-performance computing
systems is wasted because of failures. Consequently current research is focusing on …
systems is wasted because of failures. Consequently current research is focusing on …
Toward automated anomaly identification in large-scale systems
When a system fails to function properly, health-related data are collected for
troubleshooting. However, it is challenging to effectively identify anomalies from the …
troubleshooting. However, it is challenging to effectively identify anomalies from the …
[PDF][PDF] Ensemble of Bayesian predictors and decision trees for proactive failure management in cloud computing systems.
In modern cloud computing systems, hundreds and even thousands of cloud servers are
interconnected by multi-layer networks. In such large-scale and complex systems, failures …
interconnected by multi-layer networks. In such large-scale and complex systems, failures …
Logmaster: Mining event correlations in logs of large-scale cluster systems
This paper presents a set of innovative algorithms and a system, named Log Master, for
mining correlations of events that have multiple attributions, ie, node ID, application ID, event …
mining correlations of events that have multiple attributions, ie, node ID, application ID, event …
Fault-aware, utility-based job scheduling on blue, gene/p systems
Job scheduling on large-scale systems is an increasingly complicated affair, with numerous
factors influencing scheduling policy. Addressing these concerns results in sophisticated …
factors influencing scheduling policy. Addressing these concerns results in sophisticated …
Mining frequent itemsets in a stream
Mining frequent itemsets in a datastream proves to be a difficult problem, as itemsets arrive
in rapid succession and storing parts of the stream is typically impossible. Nonetheless, it …
in rapid succession and storing parts of the stream is typically impossible. Nonetheless, it …
Taming of the shrew: Modeling the normal and faulty behaviour of large-scale HPC systems
HPC systems are complex machines that generate a huge volume of system state data
called" events". Events are generated without following a general consistent rule and …
called" events". Events are generated without following a general consistent rule and …