Self-healing in emerging cellular networks: Review, challenges, and research directions

A Asghar, H Farooq, A Imran - IEEE Communications Surveys & …, 2018 - ieeexplore.ieee.org
Mobile cellular network operators spend nearly a quarter of their revenue on network
management and maintenance. Incidentally, a significant proportion of that budget is spent …

Systems approaches to tackling configuration errors: A survey

T Xu, Y Zhou - ACM Computing Surveys (CSUR), 2015 - dl.acm.org
In recent years, configuration errors (ie, misconfigurations) have become one of the
dominant causes of system failures, resulting in many severe service outages and …

Pivot tracing: Dynamic causal monitoring for distributed systems

J Mace, R Roelke, R Fonseca - ACM Transactions on Computer Systems …, 2018 - dl.acm.org
Monitoring and troubleshooting distributed systems is notoriously difficult; potential problems
are complex, varied, and unpredictable. The monitoring and diagnosis tools commonly used …

Why does the cloud stop computing? lessons from hundreds of service outages

HS Gunawi, M Hao, RO Suminto, A Laksono… - Proceedings of the …, 2016 - dl.acm.org
We conducted a cloud outage study (COS) of 32 popular Internet services. We analyzed
1247 headline news and public post-mortem reports that detail 597 unplanned outages that …

Face it yourselves: An llm-based two-stage strategy to localize configuration errors via logs

S Shan, Y Huo, Y Su, Y Li, D Li, Z Zheng - Proceedings of the 33rd ACM …, 2024 - dl.acm.org
Configurable software systems are prone to configuration errors, resulting in significant
losses to companies. However, diagnosing these errors is challenging due to the vast and …

Simple testing can prevent most critical failures: An analysis of production failures in distributed {Data-Intensive} systems

D Yuan, Y Luo, X Zhuang, GR Rodrigues… - … USENIX Symposium on …, 2014 - usenix.org
Large, production quality distributed systems still fail periodically, and do so sometimes
catastrophically, where most or all users experience an outage or data loss. We present the …

Hey, you have given me too many knobs!: Understanding and dealing with over-designed configuration in system software

T Xu, L **, X Fan, Y Zhou, S Pasupathy… - Proceedings of the 2015 …, 2015 - dl.acm.org
Configuration problems are not only prevalent, but also severely impair the reliability of
today's system software. One fundamental reason is the ever-increasing complexity of …

Fail-slow at scale: Evidence of hardware performance faults in large production systems

HS Gunawi, RO Suminto, R Sears, C Golliher… - ACM Transactions on …, 2018 - dl.acm.org
Fail-slow hardware is an under-studied failure mode. We present a study of 114 reports of
fail-slow hardware incidents, collected from large-scale cluster deployments in 14 …

X-ray: Automating {Root-Cause} diagnosis of performance anomalies in production software

M Attariyan, M Chow, J Flinn - 10th USENIX Symposium on Operating …, 2012 - usenix.org
Troubleshooting the performance of production software is challenging. Most existing tools,
such as profiling, tracing, and logging systems, reveal what events occurred during …

Mining stackoverflow for program repair

X Liu, H Zhong - 2018 IEEE 25th international conference on …, 2018 - ieeexplore.ieee.org
In recent years, automatic program repair has been a hot research topic in the software
engineering community, and many approaches have been proposed. Although these …