Why does the cloud stop computing? lessons from hundreds of service outages

HS Gunawi, M Hao, RO Suminto, A Laksono… - Proceedings of the …, 2016 - dl.acm.org
We conducted a cloud outage study (COS) of 32 popular Internet services. We analyzed
1247 headline news and public post-mortem reports that detail 597 unplanned outages that …

Making disk failure predictions {SMARTer}!

S Lu, B Luo, T Patel, Y Yao, D Tiwari… - 18th USENIX Conference …, 2020 - usenix.org
Disk drives are one of the most commonly replaced hardware components and continue to
pose challenges for accurate failure prediction. In this work, we present analysis and …

What can we learn from four years of data center hardware failures?

G Wang, L Zhang, W Xu - 2017 47th Annual IEEE/IFIP …, 2017 - ieeexplore.ieee.org
Hardware failures have a big impact on the dependability of large-scale data centers. We
present studies on over 290,000 hardware failure reports collected over the past four years …

Research on the design of analytical communication and information model for teaching resources with cloud‐sharing platform

W Zheng, BA Muthu, SN Kadry - Computer Applications in …, 2021 - Wiley Online Library
The recent growth in the adoption by several organizations of cloud computing services has
created the challenge to determine their performance of information model. The ability of …

An analysis of {Network-Partitioning} failures in cloud systems

A Alquraan, H Takruri, M Alfatafta… - 13th USENIX Symposium …, 2018 - usenix.org
We present a comprehensive study of 136 system failures attributed to network-partitioning
faults from 25 widely used distributed systems. We found that the majority of the failures led …

Cost effective, reliable and secure workflow deployment over federated clouds

Z Wen, J Cała, P Watson… - IEEE Transactions on …, 2016 - ieeexplore.ieee.org
The significant growth in cloud computing has led to increasing number of cloud providers,
each offering their service under different conditions-one might be more secure whilst …

Failure analysis of jobs in compute clouds: A google cluster case study

X Chen, CD Lu, K Pattabiraman - 2014 IEEE 25th International …, 2014 - ieeexplore.ieee.org
In this paper, we analyze a workload trace from the Google cloud cluster and characterize
the observed failures. The goal of our work is to improve the understanding of failures in …

Lessons and actions: What we learned from 10k {SSD-Related} storage system failures

E Xu, M Zheng, F Qin, Y Xu, J Wu - 2019 USENIX Annual Technical …, 2019 - usenix.org
Modern datacenters increasingly use flash-based solid state drives (SSDs) for high
performance and low energy cost. However, SSD introduces more complex failure modes …

Adaptive fault tolerant resource allocation scheme for cloud computing environments

V Sathiyamoorthi, P Keerthika, P Suresh… - … of Organizational and …, 2021 - igi-global.com
Cloud computing is an optimistic technology that leverages the computing resources to offer
globally better and more efficient services than the collection of individual use of internet …

Failure prediction using machine learning in a virtualised HPC system and application

B Mohammed, I Awan, H Ugail, M Younas - Cluster Computing, 2019 - Springer
Failure is an increasingly important issue in high performance computing and cloud
systems. As large-scale systems continue to grow in scale and complexity, mitigating the …