A large-scale study of failures in high-performance computing systems
Designing highly dependable systems requires a good understanding of failure
characteristics. Unfortunately, little raw data on failures in large IT installations are publicly …
characteristics. Unfortunately, little raw data on failures in large IT installations are publicly …
Understanding disk failure rates: What does an MTTF of 1,000,000 hours mean to you?
Component failure in large-scale IT installations is becoming an ever-larger problem as the
number of components in a single cluster approaches a million. This article is an extension …
number of components in a single cluster approaches a million. This article is an extension …
[SÁCH][B] Fault tolerance techniques for high-performance computing
This chapter provides an introduction to resilience methods. The emphasis is on
checkpointing, the de-facto standard technique for resilience in High Performance …
checkpointing, the de-facto standard technique for resilience in High Performance …
Bluegene/l failure analysis and prediction models
The growing computational and storage needs of several scientific applications mandate the
deployment of extreme-scale parallel machines, such as IBM's BlueGene/L which can …
deployment of extreme-scale parallel machines, such as IBM's BlueGene/L which can …
Failure data analysis of a large-scale heterogeneous server environment
The growing complexity of hardware and software mandates the recognition of fault
occurrence in system deployment and management. While there are several techniques to …
occurrence in system deployment and management. While there are several techniques to …
Exploring event correlation for failure prediction in coalitions of clusters
In large-scale networked computing systems, component failures become norms instead of
exceptions. Failure prediction is a crucial technique for self-managing resource burdens …
exceptions. Failure prediction is a crucial technique for self-managing resource burdens …
[PDF][PDF] A realistic evaluation of memory hardware errors and software system susceptibility
Memory hardware reliability is an indispensable part of whole-system dependability. This
paper presents the collection of realistic memory hardware error traces (including transient …
paper presents the collection of realistic memory hardware error traces (including transient …
An empirical failure-analysis of a large-scale cloud computing environment
Cloud computing research is in great need of statistical parameters derived from the
analysis of real-world systems. One aspect of this is the failure characteristics of Cloud …
analysis of real-world systems. One aspect of this is the failure characteristics of Cloud …
[PDF][PDF] Ensemble of Bayesian predictors and decision trees for proactive failure management in cloud computing systems.
In modern cloud computing systems, hundreds and even thousands of cloud servers are
interconnected by multi-layer networks. In such large-scale and complex systems, failures …
interconnected by multi-layer networks. In such large-scale and complex systems, failures …
Performance implications of failures in large-scale cluster scheduling
As we continue to evolve into large-scale parallel systems, many of them employing
hundreds of computing engines to take on mission-critical roles, it is crucial to design those …
hundreds of computing engines to take on mission-critical roles, it is crucial to design those …