Systems approaches to tackling configuration errors: A survey

T Xu, Y Zhou - ACM Computing Surveys (CSUR), 2015 - dl.acm.org
In recent years, configuration errors (ie, misconfigurations) have become one of the
dominant causes of system failures, resulting in many severe service outages and …

Software configuration engineering in practice interviews, survey, and systematic literature review

M Sayagh, N Kerzazi, B Adams… - IEEE Transactions on …, 2018 - ieeexplore.ieee.org
Modern software applications are adapted to different situations (eg, memory limits,
enabling/disabling features, database credentials) by changing the values of configuration …

Why does the cloud stop computing? lessons from hundreds of service outages

HS Gunawi, M Hao, RO Suminto, A Laksono… - Proceedings of the …, 2016 - dl.acm.org
We conducted a cloud outage study (COS) of 32 popular Internet services. We analyzed
1247 headline news and public post-mortem reports that detail 597 unplanned outages that …

Understanding and detecting real-world performance bugs

G **, L Song, X Shi, J Scherpelz, S Lu - ACM SIGPLAN Notices, 2012 - dl.acm.org
Developers frequently use inefficient code sequences that could be fixed by simple patches.
These inefficient code sequences can cause significant performance degradation and …

Face it yourselves: An llm-based two-stage strategy to localize configuration errors via logs

S Shan, Y Huo, Y Su, Y Li, D Li, Z Zheng - Proceedings of the 33rd ACM …, 2024 - dl.acm.org
Configurable software systems are prone to configuration errors, resulting in significant
losses to companies. However, diagnosing these errors is challenging due to the vast and …

Hey, you have given me too many knobs!: Understanding and dealing with over-designed configuration in system software

T Xu, L **, X Fan, Y Zhou, S Pasupathy… - Proceedings of the 2015 …, 2015 - dl.acm.org
Configuration problems are not only prevalent, but also severely impair the reliability of
today's system software. One fundamental reason is the ever-increasing complexity of …

X-ray: Automating {Root-Cause} diagnosis of performance anomalies in production software

M Attariyan, M Chow, J Flinn - 10th USENIX Symposium on Operating …, 2012 - usenix.org
Troubleshooting the performance of production software is challenging. Most existing tools,
such as profiling, tracing, and logging systems, reveal what events occurred during …

An empirical study on configuration errors in commercial and open source systems

Z Yin, X Ma, J Zheng, Y Zhou… - Proceedings of the …, 2011 - dl.acm.org
Configuration errors (ie, misconfigurations) are among the dominant causes of system
failures. Their importance has inspired many research efforts on detecting, diagnosing, and …

libdft: Practical dynamic data flow tracking for commodity systems

VP Kemerlis, G Portokalidis, K Jee… - Proceedings of the 8th …, 2012 - dl.acm.org
Dynamic data flow tracking (DFT) deals with tagging and tracking data of interest as they
propagate during program execution. DFT has been repeatedly implemented by a variety of …

Fail-slow at scale: Evidence of hardware performance faults in large production systems

HS Gunawi, RO Suminto, R Sears, C Golliher… - ACM Transactions on …, 2018 - dl.acm.org
Fail-slow hardware is an under-studied failure mode. We present a study of 114 reports of
fail-slow hardware incidents, collected from large-scale cluster deployments in 14 …