Systems approaches to tackling configuration errors: A survey

T Xu, Y Zhou - ACM Computing Surveys (CSUR), 2015 - dl.acm.org
In recent years, configuration errors (ie, misconfigurations) have become one of the
dominant causes of system failures, resulting in many severe service outages and …

An empirical study on tensorflow program bugs

Y Zhang, Y Chen, SC Cheung, Y **ong… - Proceedings of the 27th …, 2018 - dl.acm.org
Deep learning applications become increasingly popular in important domains such as self-
driving systems and facial identity systems. Defective deep learning applications may lead to …

Why does the cloud stop computing? lessons from hundreds of service outages

HS Gunawi, M Hao, RO Suminto, A Laksono… - Proceedings of the …, 2016 - dl.acm.org
We conducted a cloud outage study (COS) of 32 popular Internet services. We analyzed
1247 headline news and public post-mortem reports that detail 597 unplanned outages that …

Understanding memory and thread safety practices and issues in real-world Rust programs

B Qin, Y Chen, Z Yu, L Song, Y Zhang - Proceedings of the 41st ACM …, 2020 - dl.acm.org
Rust is a young programming language designed for systems software development. It aims
to provide safety guarantees like high-level languages and performance efficiency like low …

Tiny-tail flash: Near-perfect elimination of garbage collection tail latencies in NAND SSDs

S Yan, H Li, M Hao, MH Tong… - ACM Transactions on …, 2017 - dl.acm.org
Flash storage has become the mainstream destination for storage users. However, SSDs do
not always deliver the performance that users expect. The core culprit of flash performance …

Improving high-impact bug report prediction with combination of interactive machine learning and active learning

X Wu, W Zheng, X Chen, Y Zhao, T Yu, D Mu - Information and Software …, 2021 - Elsevier
Context: Bug reports record issues found during software development and maintenance. A
high-impact bug report (HBR) describes an issue that can cause severe damage once …

Hey, you have given me too many knobs!: Understanding and dealing with over-designed configuration in system software

T Xu, L **, X Fan, Y Zhou, S Pasupathy… - Proceedings of the 2015 …, 2015 - dl.acm.org
Configuration problems are not only prevalent, but also severely impair the reliability of
today's system software. One fundamental reason is the ever-increasing complexity of …

[HTML][HTML] A systematic literature review on benchmarks for evaluating debugging approaches

T Hirsch, B Hofer - Journal of Systems and Software, 2022 - Elsevier
Bug benchmarks are used in development and evaluation of debugging approaches, eg
fault localization and automated repair. Quantitative performance comparison of different …

SGX-LKL: Securing the host OS interface for trusted execution

C Priebe, D Muthukumaran, J Lind, H Zhu… - arxiv preprint arxiv …, 2019 - arxiv.org
Hardware support for trusted execution in modern CPUs enables tenants to shield their data
processing workloads in otherwise untrusted cloud environments. Runtime systems for the …

Fail-slow at scale: Evidence of hardware performance faults in large production systems

HS Gunawi, RO Suminto, R Sears, C Golliher… - ACM Transactions on …, 2018 - dl.acm.org
Fail-slow hardware is an under-studied failure mode. We present a study of 114 reports of
fail-slow hardware incidents, collected from large-scale cluster deployments in 14 …