- Academic Search

F Salfner, M Lenk, M Malek - ACM Computing Surveys (CSUR), 2010 - dl.acm.org

With the ever-growing complexity and dynamicity of computer systems, proactive fault
management is an effective approach to enhancing availability. Online failure prediction is …

Save Cite Cited by 827 Related articles All 11 versions Free GPT-4

Fault tolerance in petascale/exascale systems: Current knowledge, challenges and research opportunities

F Cappello - The International Journal of High Performance …, 2009 - journals.sagepub.com

The emergence of petascale systems and the promise of future exascale systems have
reinvigorated the community interest in how to manage failures in such systems and ensure …

Save Cite Cited by 309 Related articles All 4 versions Free GPT-4

[Free GPT-4]

[PDF] psu.edu

What supercomputers say: A study of five system logs

A Oliner, J Stearley - 37th annual IEEE/IFIP international …, 2007 - ieeexplore.ieee.org

If we hope to automatically detect and diagnose failures in large-scale computer systems,
we must study real deployed systems and the data they generate. Progress has been …

Save Cite Cited by 748 Related articles All 8 versions Free GPT-4

[Free GPT-4]

[PDF] cv-foundation.org

Informed haar-like features improve pedestrian detection

S Zhang, C Bauckhage… - Proceedings of the IEEE …, 2014 - cv-foundation.org

We propose a simple yet effective detector for pedestrian detection. The basic idea is to
incorporate common sense and everyday knowledge into the design of simple and …

Save Cite Cited by 414 Related articles All 10 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] drj.com

Why does the cloud stop computing? lessons from hundreds of service outages

HS Gunawi, M Hao, RO Suminto, A Laksono… - Proceedings of the …, 2016 - dl.acm.org

We conducted a cloud outage study (COS) of 32 popular Internet services. We analyzed
1247 headline news and public post-mortem reports that detail 597 unplanned outages that …

Save Cite Cited by 301 Related articles All 6 versions Free GPT-4

[Free GPT-4]

[PDF] springer.com

A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems

IP Egwutuoha, D Levy, B Selic, S Chen - The Journal of Supercomputing, 2013 - Springer

Abstract In recent years, High Performance Computing (HPC) systems have been shifting
from expensive massively parallel architectures to clusters of commodity PCs to take …

Save Cite Cited by 346 Related articles All 12 versions Free GPT-4

[Free GPT-4]

[PDF] illinois.edu

Toward exascale resilience

F Cappello, A Geist, B Gropp, L Kale… - … Journal of High …, 2009 - journals.sagepub.com

Over the past few years resilience has became a major issue for high-performance
computing (HPC) systems, in particular in the perspective of large petascale systems and …

Save Cite Cited by 485 Related articles All 14 versions Free GPT-4

[Free GPT-4]

[PDF] psu.edu

Segment-based stereo matching using graph cuts

L Hong, G Chen - Proceedings of the 2004 IEEE Computer …, 2004 - ieeexplore.ieee.org

In this paper we present a new segment-based stereo matching algorithm using graph cuts.
In our approach, the reference image is divided into non-overlap** homogeneous …

Save Cite Cited by 522 Related articles All 8 versions Free GPT-4

[Free GPT-4]

[PDF] acm.org

Failures in large scale systems: long-term measurement, analysis, and implications

S Gupta, T Patel, C Engelmann, D Tiwari - Proceedings of the …, 2017 - dl.acm.org

Resilience is one of the key challenges in maintaining high efficiency of future extreme scale
supercomputers. Researchers and system practitioners rely on field-data studies to …

Save Cite Cited by 184 Related articles All 12 versions Free GPT-4

[Free GPT-4]

[PDF] archive.org

Lessons learned from the analysis of system failures at petascale: The case of blue waters

C Di Martino, Z Kalbarczyk, RK Iyer… - 2014 44th Annual …, 2014 - ieeexplore.ieee.org

This paper provides an analysis of failures and their impact for Blue Waters, the Cray hybrid
(CPU/GPU) supercomputer at the University of Illinois at Urbana-Champaign. The analysis …

Save Cite Cited by 280 Related articles All 5 versions Free GPT-4

Create alert

Cite

Advanced search

Saved to My library

Bluegene/l failure analysis and prediction models

A survey of online failure prediction methods

Fault tolerance in petascale/exascale systems: Current knowledge, challenges and research opportunities

What supercomputers say: A study of five system logs

Informed haar-like features improve pedestrian detection

Why does the cloud stop computing? lessons from hundreds of service outages

A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems

Toward exascale resilience

Segment-based stereo matching using graph cuts

Failures in large scale systems: long-term measurement, analysis, and implications

Lessons learned from the analysis of system failures at petascale: The case of blue waters