- Academic Search

B Schroeder, GA Gibson - IEEE transactions on Dependable …, 2009 - ieeexplore.ieee.org

Designing highly dependable systems requires a good understanding of failure
characteristics. Unfortunately, little raw data on failures in large IT installations are publicly …

Lưu Trích dẫn Trích dẫn 1598 bài viết Bài viết có liên quan Tất cả 27 phiên bản

[免费ChatGPT] [DeepSeek可用网址] [PDF] usenix.org

Understanding disk failure rates: What does an MTTF of 1,000,000 hours mean to you?

B Schroeder, GA Gibson - ACM Transactions on Storage (TOS), 2007 - dl.acm.org

Component failure in large-scale IT installations is becoming an ever-larger problem as the
number of components in a single cluster approaches a million. This article is an extension …

Lưu Trích dẫn Trích dẫn 1357 bài viết Bài viết có liên quan Tất cả 36 phiên bản

[免费ChatGPT] [DeepSeek可用网址] [PDF] utk.edu

[SÁCH][B] Fault tolerance techniques for high-performance computing

J Dongarra, T Herault, Y Robert - 2015 - Springer

This chapter provides an introduction to resilience methods. The emphasis is on
checkpointing, the de-facto standard technique for resilience in High Performance …

Lưu Trích dẫn Trích dẫn 273 bài viết Bài viết có liên quan Tất cả 20 phiên bản Tìm kiếm Thư viện

[免费ChatGPT] [DeepSeek可用网址] [PDF] rutgers.edu

Bluegene/l failure analysis and prediction models

Y Liang, Y Zhang, A Sivasubramaniam… - … and Networks (DSN' …, 2006 - ieeexplore.ieee.org

The growing computational and storage needs of several scientific applications mandate the
deployment of extreme-scale parallel machines, such as IBM's BlueGene/L which can …

Lưu Trích dẫn Trích dẫn 407 bài viết Bài viết có liên quan Tất cả 14 phiên bản

[免费ChatGPT] [DeepSeek可用网址] [PDF] cmu.edu

Failure data analysis of a large-scale heterogeneous server environment

RK Sahoo, MS Squillante… - … and Networks, 2004, 2004 - ieeexplore.ieee.org

The growing complexity of hardware and software mandates the recognition of fault
occurrence in system deployment and management. While there are several techniques to …

Lưu Trích dẫn Trích dẫn 340 bài viết Bài viết có liên quan Tất cả 21 phiên bản

[免费ChatGPT] [DeepSeek可用网址] [PDF] psu.edu

Exploring event correlation for failure prediction in coalitions of clusters

S Fu, CZ Xu - Proceedings of the 2007 ACM/IEEE conference on …, 2007 - dl.acm.org

In large-scale networked computing systems, component failures become norms instead of
exceptions. Failure prediction is a crucial technique for self-managing resource burdens …

Lưu Trích dẫn Trích dẫn 244 bài viết Bài viết có liên quan Tất cả 11 phiên bản

[免费ChatGPT] [DeepSeek可用网址] [PDF] usenix.org

[PDF][PDF] A realistic evaluation of memory hardware errors and software system susceptibility

X Li, MC Huang, K Shen, L Chu - 2010 USENIX Annual Technical …, 2010 - usenix.org

Memory hardware reliability is an indispensable part of whole-system dependability. This
paper presents the collection of realistic memory hardware error traces (including transient …

Lưu Trích dẫn Trích dẫn 180 bài viết Bài viết có liên quan Tất cả 21 phiên bản Xem dạng HTML

[免费ChatGPT] [DeepSeek可用网址] [PDF] whiterose.ac.uk

An empirical failure-analysis of a large-scale cloud computing environment

P Garraghan, P Townend, J Xu - 2014 IEEE 15th International …, 2014 - ieeexplore.ieee.org

Cloud computing research is in great need of statistical parameters derived from the
analysis of real-world systems. One aspect of this is the failure characteristics of Cloud …

Lưu Trích dẫn Trích dẫn 112 bài viết Bài viết có liên quan Tất cả 6 phiên bản

[免费ChatGPT] [DeepSeek可用网址] [PDF] researchgate.net

[PDF][PDF] Ensemble of Bayesian predictors and decision trees for proactive failure management in cloud computing systems.

Q Guan, Z Zhang, S Fu - J. Commun., 2012 - researchgate.net

In modern cloud computing systems, hundreds and even thousands of cloud servers are
interconnected by multi-layer networks. In such large-scale and complex systems, failures …

Lưu Trích dẫn Trích dẫn 119 bài viết Bài viết có liên quan Tất cả 4 phiên bản Xem dạng HTML

[免费ChatGPT] [DeepSeek可用网址] [PDF] psu.edu

Performance implications of failures in large-scale cluster scheduling

Y Zhang, MS Squillante, A Sivasubramaniam… - … Strategies for Parallel …, 2005 - Springer

As we continue to evolve into large-scale parallel systems, many of them employing
hundreds of computing engines to take on mission-critical roles, it is crucial to design those …

Lưu Trích dẫn Trích dẫn 180 bài viết Bài viết có liên quan Tất cả 18 phiên bản

Tạo thông báo

Trích dẫn

Tìm kiếm nâng cao

Đã lưu vào Thư viện của tôi

Improving cluster availability using workstation validation

A large-scale study of failures in high-performance computing systems