A survey of online failure prediction methods

F Salfner, M Lenk, M Malek - ACM Computing Surveys (CSUR), 2010 - dl.acm.org
With the ever-growing complexity and dynamicity of computer systems, proactive fault
management is an effective approach to enhancing availability. Online failure prediction is …

Service-generated big data and big data-as-a-service: an overview

Z Zheng, J Zhu, MR Lyu - 2013 IEEE international congress on …, 2013 - ieeexplore.ieee.org
With the prevalence of service computing and cloud computing, more and more services are
emerging on the Internet, generating huge volume of data, such as trace logs, QoS …

Detecting large-scale system problems by mining console logs

W Xu, L Huang, A Fox, D Patterson… - Proceedings of the ACM …, 2009 - dl.acm.org
Surprisingly, console logs rarely help operators detect problems in large-scale datacenter
services, for they often consist of the voluminous intermixing of messages from many …

Using Magpie for request extraction and workload modelling.

P Barham, A Donnelly, R Isaacs, R Mortier - OSDI, 2004 - usenix.org
Tools to understand complex system behaviour are essential for many performance analysis
and debugging tasks, yet there are many open research problems in their development …

Pivot tracing: Dynamic causal monitoring for distributed systems

J Mace, R Roelke, R Fonseca - ACM Transactions on Computer Systems …, 2018 - dl.acm.org
Monitoring and troubleshooting distributed systems is notoriously difficult; potential problems
are complex, varied, and unpredictable. The monitoring and diagnosis tools commonly used …

A survey of aiops methods for failure management

P Notaro, J Cardoso, M Gerndt - ACM Transactions on Intelligent …, 2021 - dl.acm.org
Modern society is increasingly moving toward complex and distributed computing systems.
The increase in scale and complexity of these systems challenges O&M teams that perform …

Enjoy your observability: an industrial survey of microservice tracing and analysis

B Li, X Peng, Q **ang, H Wang, T **e, J Sun… - Empirical Software …, 2022 - Springer
Microservice systems are often deployed in complex cloud-based environments and may
involve a large number of service instances being dynamically created and destroyed. It is …

Failure diagnosis using decision trees

M Chen, AX Zheng, J Lloyd, MI Jordan… - International …, 2004 - ieeexplore.ieee.org
We present a decision tree learning approach to diagnosing failures in large Internet sites.
We record runtime properties of each request and apply automated machine learning and …

Canopy: An end-to-end performance tracing and analysis system

J Kaldor, J Mace, M Bejda, E Gao… - Proceedings of the 26th …, 2017 - dl.acm.org
This paper presents Canopy, Facebook's end-to-end performance tracing infrastructure.
Canopy records causally related performance data across the end-to-end execution path of …

[PDF][PDF] Correlating Instrumentation Data to System States: A Building Block for Automated Diagnosis and Control.

I Cohen, JS Chase, M Goldszmidt, T Kelly, J Symons - OSDI, 2004 - users.ece.cmu.edu
This paper studies the use of statistical induction techniques as a basis for automated
performance diagnosis and performance management. The goal of the work is to develop …