What supercomputers say: A study of five system logs
If we hope to automatically detect and diagnose failures in large-scale computer systems,
we must study real deployed systems and the data they generate. Progress has been …
we must study real deployed systems and the data they generate. Progress has been …
Lessons learned from the analysis of system failures at petascale: The case of blue waters
This paper provides an analysis of failures and their impact for Blue Waters, the Cray hybrid
(CPU/GPU) supercomputer at the University of Illinois at Urbana-Champaign. The analysis …
(CPU/GPU) supercomputer at the University of Illinois at Urbana-Champaign. The analysis …
Failures in large scale systems: long-term measurement, analysis, and implications
Resilience is one of the key challenges in maintaining high efficiency of future extreme scale
supercomputers. Researchers and system practitioners rely on field-data studies to …
supercomputers. Researchers and system practitioners rely on field-data studies to …
Bluegene/l failure analysis and prediction models
The growing computational and storage needs of several scientific applications mandate the
deployment of extreme-scale parallel machines, such as IBM's BlueGene/L which can …
deployment of extreme-scale parallel machines, such as IBM's BlueGene/L which can …
What can we learn from four years of data center hardware failures?
Hardware failures have a big impact on the dependability of large-scale data centers. We
present studies on over 290,000 hardware failure reports collected over the past four years …
present studies on over 290,000 hardware failure reports collected over the past four years …
Understanding GPU errors on large-scale HPC systems and the implications for system design and operation
D Tiwari, S Gupta, J Rogers, D Maxwell… - 2015 IEEE 21st …, 2015 - ieeexplore.ieee.org
Increase in graphics hardware performance and improvements in programmability has
enabled GPUs to evolve from a graphics-specific accelerator to a general-purpose …
enabled GPUs to evolve from a graphics-specific accelerator to a general-purpose …
Exploring event correlation for failure prediction in coalitions of clusters
In large-scale networked computing systems, component failures become norms instead of
exceptions. Failure prediction is a crucial technique for self-managing resource burdens …
exceptions. Failure prediction is a crucial technique for self-managing resource burdens …
Log‐based anomaly detection for distributed systems: State of the art, industry experience, and open issues
Distributed systems have been widely used in many safety‐critical areas. Any abnormalities
(eg, service interruption or service quality degradation) could lead to application crashes or …
(eg, service interruption or service quality degradation) could lead to application crashes or …
A large-scale study of soft-errors on GPUs in the field
Parallelism provided by the GPU architecture has enabled domain scientists to simulate
physical phenomena at a much faster rate and finer granularity than what was previously …
physical phenomena at a much faster rate and finer granularity than what was previously …
Reliability lessons learned from gpu experience with the titan supercomputer at oak ridge leadership computing facility
D Tiwari, S Gupta, G Gallarno, J Rogers… - Proceedings of the …, 2015 - dl.acm.org
The high computational capability of graphics processing units (GPUs) is enabling and
driving the scientific discovery process at large-scale. The world's second fastest …
driving the scientific discovery process at large-scale. The world's second fastest …