A large-scale study of failures in high-performance computing systems

B Schroeder, GA Gibson - IEEE transactions on Dependable …, 2009 - ieeexplore.ieee.org
Designing highly dependable systems requires a good understanding of failure
characteristics. Unfortunately, little raw data on failures in large IT installations are publicly …

Understanding disk failure rates: What does an MTTF of 1,000,000 hours mean to you?

B Schroeder, GA Gibson - ACM Transactions on Storage (TOS), 2007 - dl.acm.org
Component failure in large-scale IT installations is becoming an ever-larger problem as the
number of components in a single cluster approaches a million. This article is an extension …

[SÁCH][B] Fault tolerance techniques for high-performance computing

J Dongarra, T Herault, Y Robert - 2015 - Springer
This chapter provides an introduction to resilience methods. The emphasis is on
checkpointing, the de-facto standard technique for resilience in High Performance …

Bluegene/l failure analysis and prediction models

Y Liang, Y Zhang, A Sivasubramaniam… - … and Networks (DSN' …, 2006 - ieeexplore.ieee.org
The growing computational and storage needs of several scientific applications mandate the
deployment of extreme-scale parallel machines, such as IBM's BlueGene/L which can …

Failure data analysis of a large-scale heterogeneous server environment

RK Sahoo, MS Squillante… - … and Networks, 2004, 2004 - ieeexplore.ieee.org
The growing complexity of hardware and software mandates the recognition of fault
occurrence in system deployment and management. While there are several techniques to …

Exploring event correlation for failure prediction in coalitions of clusters

S Fu, CZ Xu - Proceedings of the 2007 ACM/IEEE conference on …, 2007 - dl.acm.org
In large-scale networked computing systems, component failures become norms instead of
exceptions. Failure prediction is a crucial technique for self-managing resource burdens …

[PDF][PDF] A realistic evaluation of memory hardware errors and software system susceptibility

X Li, MC Huang, K Shen, L Chu - 2010 USENIX Annual Technical …, 2010 - usenix.org
Memory hardware reliability is an indispensable part of whole-system dependability. This
paper presents the collection of realistic memory hardware error traces (including transient …

An empirical failure-analysis of a large-scale cloud computing environment

P Garraghan, P Townend, J Xu - 2014 IEEE 15th International …, 2014 - ieeexplore.ieee.org
Cloud computing research is in great need of statistical parameters derived from the
analysis of real-world systems. One aspect of this is the failure characteristics of Cloud …

[PDF][PDF] Ensemble of Bayesian predictors and decision trees for proactive failure management in cloud computing systems.

Q Guan, Z Zhang, S Fu - J. Commun., 2012 - researchgate.net
In modern cloud computing systems, hundreds and even thousands of cloud servers are
interconnected by multi-layer networks. In such large-scale and complex systems, failures …

Performance implications of failures in large-scale cluster scheduling

Y Zhang, MS Squillante, A Sivasubramaniam… - … Strategies for Parallel …, 2005 - Springer
As we continue to evolve into large-scale parallel systems, many of them employing
hundreds of computing engines to take on mission-critical roles, it is crucial to design those …