A survey of online failure prediction methods
F Salfner, M Lenk, M Malek - ACM Computing Surveys (CSUR), 2010 - dl.acm.org
With the ever-growing complexity and dynamicity of computer systems, proactive fault
management is an effective approach to enhancing availability. Online failure prediction is …
management is an effective approach to enhancing availability. Online failure prediction is …
Fault tolerance in petascale/exascale systems: Current knowledge, challenges and research opportunities
F Cappello - The International Journal of High Performance …, 2009 - journals.sagepub.com
The emergence of petascale systems and the promise of future exascale systems have
reinvigorated the community interest in how to manage failures in such systems and ensure …
reinvigorated the community interest in how to manage failures in such systems and ensure …
What supercomputers say: A study of five system logs
If we hope to automatically detect and diagnose failures in large-scale computer systems,
we must study real deployed systems and the data they generate. Progress has been …
we must study real deployed systems and the data they generate. Progress has been …
Informed haar-like features improve pedestrian detection
We propose a simple yet effective detector for pedestrian detection. The basic idea is to
incorporate common sense and everyday knowledge into the design of simple and …
incorporate common sense and everyday knowledge into the design of simple and …
Why does the cloud stop computing? lessons from hundreds of service outages
We conducted a cloud outage study (COS) of 32 popular Internet services. We analyzed
1247 headline news and public post-mortem reports that detail 597 unplanned outages that …
1247 headline news and public post-mortem reports that detail 597 unplanned outages that …
A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems
Abstract In recent years, High Performance Computing (HPC) systems have been shifting
from expensive massively parallel architectures to clusters of commodity PCs to take …
from expensive massively parallel architectures to clusters of commodity PCs to take …
Toward exascale resilience
Over the past few years resilience has became a major issue for high-performance
computing (HPC) systems, in particular in the perspective of large petascale systems and …
computing (HPC) systems, in particular in the perspective of large petascale systems and …
Segment-based stereo matching using graph cuts
L Hong, G Chen - Proceedings of the 2004 IEEE Computer …, 2004 - ieeexplore.ieee.org
In this paper we present a new segment-based stereo matching algorithm using graph cuts.
In our approach, the reference image is divided into non-overlap** homogeneous …
In our approach, the reference image is divided into non-overlap** homogeneous …
Failures in large scale systems: long-term measurement, analysis, and implications
Resilience is one of the key challenges in maintaining high efficiency of future extreme scale
supercomputers. Researchers and system practitioners rely on field-data studies to …
supercomputers. Researchers and system practitioners rely on field-data studies to …
Lessons learned from the analysis of system failures at petascale: The case of blue waters
This paper provides an analysis of failures and their impact for Blue Waters, the Cray hybrid
(CPU/GPU) supercomputer at the University of Illinois at Urbana-Champaign. The analysis …
(CPU/GPU) supercomputer at the University of Illinois at Urbana-Champaign. The analysis …