Systems approaches to tackling configuration errors: A survey

T Xu, Y Zhou - ACM Computing Surveys (CSUR), 2015 - dl.acm.org
In recent years, configuration errors (ie, misconfigurations) have become one of the
dominant causes of system failures, resulting in many severe service outages and …

An empirical study on tensorflow program bugs

Y Zhang, Y Chen, SC Cheung, Y **ong… - Proceedings of the 27th …, 2018 - dl.acm.org
Deep learning applications become increasingly popular in important domains such as self-
driving systems and facial identity systems. Defective deep learning applications may lead to …

Why does the cloud stop computing? lessons from hundreds of service outages

HS Gunawi, M Hao, RO Suminto, A Laksono… - Proceedings of the …, 2016 - dl.acm.org
We conducted a cloud outage study (COS) of 32 popular Internet services. We analyzed
1247 headline news and public post-mortem reports that detail 597 unplanned outages that …

Understanding memory and thread safety practices and issues in real-world Rust programs

B Qin, Y Chen, Z Yu, L Song, Y Zhang - Proceedings of the 41st ACM …, 2020 - dl.acm.org
Rust is a young programming language designed for systems software development. It aims
to provide safety guarantees like high-level languages and performance efficiency like low …

Tiny-tail flash: Near-perfect elimination of garbage collection tail latencies in NAND SSDs

S Yan, H Li, M Hao, MH Tong… - ACM Transactions on …, 2017 - dl.acm.org
Flash storage has become the mainstream destination for storage users. However, SSDs do
not always deliver the performance that users expect. The core culprit of flash performance …

Fail-slow at scale: Evidence of hardware performance faults in large production systems

HS Gunawi, RO Suminto, R Sears, C Golliher… - ACM Transactions on …, 2018 - dl.acm.org
Fail-slow hardware is an under-studied failure mode. We present a study of 114 reports of
fail-slow hardware incidents, collected from large-scale cluster deployments in 14 …

How to fight production incidents? an empirical study on a large-scale cloud service

S Ghosh, M Shetty, C Bansal, S Nath - … of the 13th Symposium on Cloud …, 2022 - dl.acm.org
Production incidents in today's large-scale cloud services can be extremely expensive in
terms of customer impacts and engineering resources required to mitigate them. Despite …

TaxDC: A taxonomy of non-deterministic concurrency bugs in datacenter distributed systems

T Leesatapornwongsa, JF Lukman, S Lu… - Proceedings of the …, 2016 - dl.acm.org
We present TaxDC, the largest and most comprehensive taxonomy of non-deterministic
concurrency bugs in distributed systems. We study 104 distributed concurrency (DC) bugs …

How users interpret bugs in trigger-action programming

W Brackenbury, A Deora, J Ritchey, J Vallee… - Proceedings of the …, 2019 - dl.acm.org
Trigger-action programming (TAP) is a programming model enabling users to connect
services and devices by writing if-then rules. As such systems are deployed in increasingly …

The tail at store: A revelation from millions of hours of disk and {SSD} deployments

M Hao, G Soundararajan… - … USENIX Conference on …, 2016 - usenix.org
We study storage performance in over 450,000 disks and 4,000 SSDs over 87 days for an
overall total of 857 million (disk) and 7 million (SSD) drive hours. We find that storage …