Failure prediction for HPC systems and applications: Current situation and open issues

A Gainaru, F Cappello, M Snir… - … International journal of …, 2013 - journals.sagepub.com
As large-scale systems evolve towards post-petascale computing, it is crucial to focus on
providing fault-tolerance strategies that aim to minimize fault's effects on applications. By far …

Environmental performance analysis of solid freedom fabrication processes

Y Luo, Z Ji, MC Leu, R Caudill - Proceedings of the 1999 IEEE …, 1999 - ieeexplore.ieee.org
This paper presents a method for analyzing the environmental performance of solid freeform
fabrication (SFF) processes. In this method, each process is divided into life phases …

Failure prediction by utilizing log analysis: A systematic map** study

D Das, M Schiewe, E Brighton, M Fuller… - Proceedings of the …, 2020 - dl.acm.org
In modern computing, log files provide a wealth of information regarding the past of a
system, including the system failures and security breaches that cost companies and …

Fault prediction under the microscope: A closer look into HPC systems

A Gainaru, F Cappello, M Snir… - SC'12: Proceedings of …, 2012 - ieeexplore.ieee.org
A large percentage of computing capacity in today's large high-performance computing
systems is wasted because of failures. Consequently current research is focusing on …

Toward automated anomaly identification in large-scale systems

Z Lan, Z Zheng, Y Li - IEEE Transactions on Parallel and …, 2009 - ieeexplore.ieee.org
When a system fails to function properly, health-related data are collected for
troubleshooting. However, it is challenging to effectively identify anomalies from the …

[PDF][PDF] Ensemble of Bayesian predictors and decision trees for proactive failure management in cloud computing systems.

Q Guan, Z Zhang, S Fu - J. Commun., 2012 - researchgate.net
In modern cloud computing systems, hundreds and even thousands of cloud servers are
interconnected by multi-layer networks. In such large-scale and complex systems, failures …

Logmaster: Mining event correlations in logs of large-scale cluster systems

X Fu, R Ren, J Zhan, W Zhou, Z Jia… - 2012 IEEE 31st …, 2012 - ieeexplore.ieee.org
This paper presents a set of innovative algorithms and a system, named Log Master, for
mining correlations of events that have multiple attributions, ie, node ID, application ID, event …

Fault-aware, utility-based job scheduling on blue, gene/p systems

W Tang, Z Lan, N Desai… - 2009 IEEE International …, 2009 - ieeexplore.ieee.org
Job scheduling on large-scale systems is an increasingly complicated affair, with numerous
factors influencing scheduling policy. Addressing these concerns results in sophisticated …

Mining frequent itemsets in a stream

T Calders, N Dexters, JJM Gillis, B Goethals - Information Systems, 2014 - Elsevier
Mining frequent itemsets in a datastream proves to be a difficult problem, as itemsets arrive
in rapid succession and storing parts of the stream is typically impossible. Nonetheless, it …

Taming of the shrew: Modeling the normal and faulty behaviour of large-scale HPC systems

A Gainaru, F Cappello, W Kramer - 2012 IEEE 26th …, 2012 - ieeexplore.ieee.org
HPC systems are complex machines that generate a huge volume of system state data
called" events". Events are generated without following a general consistent rule and …