A survey on intelligent management of alerts and incidents in IT services

Q Yu, N Zhao, M Li, Z Li, H Wang, W Zhang… - Journal of Network and …, 2024 - Elsevier
Modern service systems are constantly improving with the development of various IT
technologies, leading to a boost in system scales and complex dependencies among …

The emerging role of data scientists on software development teams

M Kim, T Zimmermann, R DeLine, A Begel - Proceedings of the 38th …, 2016 - dl.acm.org
Creating and running software produces large amounts of raw data about the development
process and the customer usage, which can be turned into actionable insight with the help of …

Actionable and interpretable fault localization for recurring failures in online service systems

Z Li, N Zhao, M Li, X Lu, L Wang, D Chang… - Proceedings of the 30th …, 2022 - dl.acm.org
Fault localization is challenging in an online service system due to its monitoring data's large
volume and variety and complex dependencies across/within its components (eg, services …

Analyze this! 145 questions for data scientists in software engineering

A Begel, T Zimmermann - … of the 36th International Conference on …, 2014 - dl.acm.org
In this paper, we present the results from two surveys related to data science applied to
software engineering. The first survey solicited questions that software engineers would like …

Aiops solutions for incident management: Technical guidelines and a comprehensive literature review

Y Remil, A Bendimerad, R Mathonat… - arxiv preprint arxiv …, 2024 - arxiv.org
The management of modern IT systems poses unique challenges, necessitating scalability,
reliability, and efficiency in handling extensive data streams. Traditional methods, reliant on …

Heterogeneous anomaly detection for software systems via semi-supervised cross-modal attention

C Lee, T Yang, Z Chen, Y Su, Y Yang… - 2023 IEEE/ACM 45th …, 2023 - ieeexplore.ieee.org
Prompt and accurate detection of system anomalies is essential to ensure the reliability of
software systems. Unlike manual efforts that exploit all available run-time information …

Towards intelligent incident management: why we need it and how we make it

Z Chen, Y Kang, L Li, X Zhang, H Zhang, H Xu… - Proceedings of the 28th …, 2020 - dl.acm.org
The management of cloud service incidents (unplanned interruptions or outages of a
service/product) greatly affects customer satisfaction and business revenue. After years of …

An empirical investigation of incident triage for online service systems

J Chen, X He, Q Lin, Y Xu, H Zhang… - 2019 IEEE/ACM 41st …, 2019 - ieeexplore.ieee.org
Online service systems have become increasingly popular. During operation of an online
service system, incidents (unplanned interruptions or outages of the service) are inevitable …

Predicting node failures in an ultra-large-scale cloud computing platform: an aiops solution

Y Li, ZM Jiang, H Li, AE Hassan, C He… - ACM Transactions on …, 2020 - dl.acm.org
Many software services today are hosted on cloud computing platforms, such as Amazon
EC2, due to many benefits like reduced operational costs. However, node failures in these …

Continuous incident triage for large-scale online service systems

J Chen, X He, Q Lin, H Zhang, D Hao… - 2019 34th IEEE/ACM …, 2019 - ieeexplore.ieee.org
In recent years, online service systems have become increasingly popular. Incidents of
these systems could cause significant economic loss and customer dissatisfaction. Incident …