Systems approaches to tackling configuration errors: A survey

T Xu, Y Zhou - ACM Computing Surveys (CSUR), 2015 - dl.acm.org
In recent years, configuration errors (ie, misconfigurations) have become one of the
dominant causes of system failures, resulting in many severe service outages and …

Simple testing can prevent most critical failures: An analysis of production failures in distributed {Data-Intensive} systems

D Yuan, Y Luo, X Zhuang, GR Rodrigues… - … USENIX Symposium on …, 2014 - usenix.org
Large, production quality distributed systems still fail periodically, and do so sometimes
catastrophically, where most or all users experience an outage or data loss. We present the …

An empirical study on configuration errors in commercial and open source systems

Z Yin, X Ma, J Zheng, Y Zhou… - Proceedings of the …, 2011 - dl.acm.org
Configuration errors (ie, misconfigurations) are among the dominant causes of system
failures. Their importance has inspired many research efforts on detecting, diagnosing, and …

X-ray: Automating {Root-Cause} diagnosis of performance anomalies in production software

M Attariyan, M Chow, J Flinn - 10th USENIX Symposium on Operating …, 2012 - usenix.org
Troubleshooting the performance of production software is challenging. Most existing tools,
such as profiling, tracing, and logging systems, reveal what events occurred during …

Do not blame users for misconfigurations

T Xu, J Zhang, P Huang, J Zheng, T Sheng… - Proceedings of the …, 2013 - dl.acm.org
Similar to software bugs, configuration errors are also one of the major causes of today's
system failures. Many configuration issues manifest themselves in ways similar to software …

Toward fine-grained, unsupervised, scalable performance diagnosis for production cloud computing systems

H Mi, H Wang, Y Zhou, MRT Lyu… - IEEE Transactions on …, 2013 - ieeexplore.ieee.org
Performance diagnosis is labor intensive in production cloud computing systems. Such
systems typically face many real-world challenges, which the existing diagnosis techniques …

[PDF][PDF] Automating configuration troubleshooting with dynamic information flow analysis

M Attariyan, J Flinn - 9th USENIX Symposium on Operating Systems …, 2010 - usenix.org
Software misconfigurations are time-consuming and enormously frustrating to troubleshoot.
In this paper, we show that dynamic information flow analysis helps solve these problems by …

Metastable failures in the wild

L Huang, M Magnusson, AB Muralikrishna… - … USENIX Symposium on …, 2022 - usenix.org
Recently, Bronson et al. introduced a framework for understanding a class of failures in
distributed systems called metastable failures. The examples of metastable failures …

Challenges and opportunities: an in-depth empirical study on configuration error injection testing

W Li, Z Jia, S Li, Y Zhang, T Wang, E Xu… - Proceedings of the 30th …, 2021 - dl.acm.org
Configuration error injection testing (CEIT) could systematically evaluate software reliability
and diagnosability to runtime configuration errors. This paper explores the challenges and …

Encore: Exploiting system environment and correlation information for misconfiguration detection

J Zhang, L Renganarayana, X Zhang, N Ge… - Proceedings of the 19th …, 2014 - dl.acm.org
As software systems become more complex and configurable, failures due to
misconfigurations are becoming a critical problem. Such failures often have serious …