Why does the cloud stop computing? lessons from hundreds of service outages
We conducted a cloud outage study (COS) of 32 popular Internet services. We analyzed
1247 headline news and public post-mortem reports that detail 597 unplanned outages that …
1247 headline news and public post-mortem reports that detail 597 unplanned outages that …
Making disk failure predictions {SMARTer}!
Disk drives are one of the most commonly replaced hardware components and continue to
pose challenges for accurate failure prediction. In this work, we present analysis and …
pose challenges for accurate failure prediction. In this work, we present analysis and …
What can we learn from four years of data center hardware failures?
Hardware failures have a big impact on the dependability of large-scale data centers. We
present studies on over 290,000 hardware failure reports collected over the past four years …
present studies on over 290,000 hardware failure reports collected over the past four years …
Research on the design of analytical communication and information model for teaching resources with cloud‐sharing platform
W Zheng, BA Muthu, SN Kadry - Computer Applications in …, 2021 - Wiley Online Library
The recent growth in the adoption by several organizations of cloud computing services has
created the challenge to determine their performance of information model. The ability of …
created the challenge to determine their performance of information model. The ability of …
An analysis of {Network-Partitioning} failures in cloud systems
We present a comprehensive study of 136 system failures attributed to network-partitioning
faults from 25 widely used distributed systems. We found that the majority of the failures led …
faults from 25 widely used distributed systems. We found that the majority of the failures led …
Cost effective, reliable and secure workflow deployment over federated clouds
The significant growth in cloud computing has led to increasing number of cloud providers,
each offering their service under different conditions-one might be more secure whilst …
each offering their service under different conditions-one might be more secure whilst …
Failure analysis of jobs in compute clouds: A google cluster case study
X Chen, CD Lu, K Pattabiraman - 2014 IEEE 25th International …, 2014 - ieeexplore.ieee.org
In this paper, we analyze a workload trace from the Google cloud cluster and characterize
the observed failures. The goal of our work is to improve the understanding of failures in …
the observed failures. The goal of our work is to improve the understanding of failures in …
Lessons and actions: What we learned from 10k {SSD-Related} storage system failures
Modern datacenters increasingly use flash-based solid state drives (SSDs) for high
performance and low energy cost. However, SSD introduces more complex failure modes …
performance and low energy cost. However, SSD introduces more complex failure modes …
Adaptive fault tolerant resource allocation scheme for cloud computing environments
Cloud computing is an optimistic technology that leverages the computing resources to offer
globally better and more efficient services than the collection of individual use of internet …
globally better and more efficient services than the collection of individual use of internet …
Failure prediction using machine learning in a virtualised HPC system and application
Failure is an increasingly important issue in high performance computing and cloud
systems. As large-scale systems continue to grow in scale and complexity, mitigating the …
systems. As large-scale systems continue to grow in scale and complexity, mitigating the …