Systems approaches to tackling configuration errors: A survey
In recent years, configuration errors (ie, misconfigurations) have become one of the
dominant causes of system failures, resulting in many severe service outages and …
dominant causes of system failures, resulting in many severe service outages and …
An empirical study on tensorflow program bugs
Deep learning applications become increasingly popular in important domains such as self-
driving systems and facial identity systems. Defective deep learning applications may lead to …
driving systems and facial identity systems. Defective deep learning applications may lead to …
Why does the cloud stop computing? lessons from hundreds of service outages
We conducted a cloud outage study (COS) of 32 popular Internet services. We analyzed
1247 headline news and public post-mortem reports that detail 597 unplanned outages that …
1247 headline news and public post-mortem reports that detail 597 unplanned outages that …
Understanding memory and thread safety practices and issues in real-world Rust programs
Rust is a young programming language designed for systems software development. It aims
to provide safety guarantees like high-level languages and performance efficiency like low …
to provide safety guarantees like high-level languages and performance efficiency like low …
Tiny-tail flash: Near-perfect elimination of garbage collection tail latencies in NAND SSDs
Flash storage has become the mainstream destination for storage users. However, SSDs do
not always deliver the performance that users expect. The core culprit of flash performance …
not always deliver the performance that users expect. The core culprit of flash performance …
Fail-slow at scale: Evidence of hardware performance faults in large production systems
Fail-slow hardware is an under-studied failure mode. We present a study of 114 reports of
fail-slow hardware incidents, collected from large-scale cluster deployments in 14 …
fail-slow hardware incidents, collected from large-scale cluster deployments in 14 …
How to fight production incidents? an empirical study on a large-scale cloud service
Production incidents in today's large-scale cloud services can be extremely expensive in
terms of customer impacts and engineering resources required to mitigate them. Despite …
terms of customer impacts and engineering resources required to mitigate them. Despite …
TaxDC: A taxonomy of non-deterministic concurrency bugs in datacenter distributed systems
We present TaxDC, the largest and most comprehensive taxonomy of non-deterministic
concurrency bugs in distributed systems. We study 104 distributed concurrency (DC) bugs …
concurrency bugs in distributed systems. We study 104 distributed concurrency (DC) bugs …
How users interpret bugs in trigger-action programming
Trigger-action programming (TAP) is a programming model enabling users to connect
services and devices by writing if-then rules. As such systems are deployed in increasingly …
services and devices by writing if-then rules. As such systems are deployed in increasingly …
The tail at store: A revelation from millions of hours of disk and {SSD} deployments
We study storage performance in over 450,000 disks and 4,000 SSDs over 87 days for an
overall total of 857 million (disk) and 7 million (SSD) drive hours. We find that storage …
overall total of 857 million (disk) and 7 million (SSD) drive hours. We find that storage …