Anomaly detection and failure root cause analysis in (micro) service-based cloud applications: A survey

J Soldani, A Brogi - ACM Computing Surveys (CSUR), 2022 - dl.acm.org
The proliferation of services and service interactions within microservices and cloud-native
applications, makes it harder to detect failures and to identify their possible root causes …

Ai for it operations (aiops) on cloud platforms: Reviews, opportunities and challenges

Q Cheng, D Sahoo, A Saha, W Yang, C Liu… - arxiv preprint arxiv …, 2023 - arxiv.org
Artificial Intelligence for IT operations (AIOps) aims to combine the power of AI with the big
data generated by IT Operations processes, particularly in cloud infrastructures, to provide …

Deeptralog: Trace-log combined microservice anomaly detection through graph-based deep learning

C Zhang, X Peng, C Sha, K Zhang, Z Fu, X Wu… - Proceedings of the 44th …, 2022 - dl.acm.org
A microservice system in industry is usually a large-scale distributed system consisting of
dozens to thousands of services running in different machines. An anomaly of the system …

Root cause analysis of failures in microservices through causal discovery

A Ikram, S Chakraborty, S Mitra… - Advances in …, 2022 - proceedings.neurips.cc
Most cloud applications use a large number of smaller sub-components (called
microservices) that interact with each other in the form of a complex graph to provide the …

Eadro: An end-to-end troubleshooting framework for microservices on multi-source data

C Lee, T Yang, Z Chen, Y Su… - 2023 IEEE/ACM 45th …, 2023 - ieeexplore.ieee.org
The complexity and dynamism of microservices pose significant challenges to system
reliability, and thereby, automated troubleshooting is crucial. Effective root cause localization …

{CRISP}: Critical path analysis of {Large-Scale} microservice architectures

Z Zhang, MK Ramanathan, P Raj, A Parwal… - 2022 USENIX Annual …, 2022 - usenix.org
Microservice architectures have become the lifeblood of modern service-oriented software
systems. Remote Procedure Calls (RPCs) among microservices are deeply nested …

Practical root cause localization for microservice systems via trace analysis

Z Li, J Chen, R Jiao, N Zhao, Z Wang… - 2021 IEEE/ACM 29th …, 2021 - ieeexplore.ieee.org
Microservice architecture is applied by an increasing number of systems because of its
benefits on delivery, scalability, and autonomy. It is essential but challenging to localize root …

Identifying bad software changes via multimodal anomaly detection for online service systems

N Zhao, J Chen, Z Yu, H Wang, J Li, B Qiu… - Proceedings of the 29th …, 2021 - dl.acm.org
In large-scale online service systems, software changes are inevitable and frequent. Due to
importing new code or configurations, changes are likely to incur incidents and destroy user …

Actionable and interpretable fault localization for recurring failures in online service systems

Z Li, N Zhao, M Li, X Lu, L Wang, D Chang… - Proceedings of the 30th …, 2022 - dl.acm.org
Fault localization is challenging in an online service system due to its monitoring data's large
volume and variety and complex dependencies across/within its components (eg, services …

Timeautoad: Autonomous anomaly detection with self-supervised contrastive loss for multivariate time series

Y Jiao, K Yang, D Song, D Tao - IEEE Transactions on Network …, 2022 - ieeexplore.ieee.org
Multivariate time series (MTS) data are becoming increasingly ubiquitous in networked
systems, eg, IoT systems and 5G networks. Anomaly detection in MTS refers to identifying …