- Academic Search

F Salfner, M Lenk, M Malek - ACM Computing Surveys (CSUR), 2010 - dl.acm.org

With the ever-growing complexity and dynamicity of computer systems, proactive fault
management is an effective approach to enhancing availability. Online failure prediction is …

Save Cite Cited by 827 Related articles All 11 versions Free GPT-4

[Free GPT-4]

[PDF] github.io

Service-generated big data and big data-as-a-service: an overview

Z Zheng, J Zhu, MR Lyu - 2013 IEEE international congress on …, 2013 - ieeexplore.ieee.org

With the prevalence of service computing and cloud computing, more and more services are
emerging on the Internet, generating huge volume of data, such as trace logs, QoS …

Save Cite Cited by 311 Related articles All 7 versions Free GPT-4

[Free GPT-4]

[PDF] uwaterloo.ca

Detecting large-scale system problems by mining console logs

W Xu, L Huang, A Fox, D Patterson… - Proceedings of the ACM …, 2009 - dl.acm.org

Surprisingly, console logs rarely help operators detect problems in large-scale datacenter
services, for they often consist of the voluminous intermixing of messages from many …

Save Cite Cited by 1526 Related articles All 30 versions Free GPT-4

[Free GPT-4]

[PDF] usenix.org

Using Magpie for request extraction and workload modelling.

P Barham, A Donnelly, R Isaacs, R Mortier - OSDI, 2004 - usenix.org

Tools to understand complex system behaviour are essential for many performance analysis
and debugging tasks, yet there are many open research problems in their development …

Save Cite Cited by 912 Related articles All 21 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] acm.org

Pivot tracing: Dynamic causal monitoring for distributed systems

J Mace, R Roelke, R Fonseca - ACM Transactions on Computer Systems …, 2018 - dl.acm.org

Monitoring and troubleshooting distributed systems is notoriously difficult; potential problems
are complex, varied, and unpredictable. The monitoring and diagnosis tools commonly used …

Save Cite Cited by 365 Related articles All 24 versions Free GPT-4

[Free GPT-4]

[PDF] aiops.org

A survey of aiops methods for failure management

P Notaro, J Cardoso, M Gerndt - ACM Transactions on Intelligent …, 2021 - dl.acm.org

Modern society is increasingly moving toward complex and distributed computing systems.
The increase in scale and complexity of these systems challenges O&M teams that perform …

Save Cite Cited by 82 Related articles All 3 versions Free GPT-4

[Free GPT-4]

[PDF] springer.com

Enjoy your observability: an industrial survey of microservice tracing and analysis

B Li, X Peng, Q **ang, H Wang, T **e, J Sun… - Empirical Software …, 2022 - Springer

Microservice systems are often deployed in complex cloud-based environments and may
involve a large number of service instances being dynamically created and destroyed. It is …

Save Cite Cited by 97 Related articles All 9 versions Free GPT-4

[Free GPT-4]

[PDF] berkeley.edu

Failure diagnosis using decision trees

M Chen, AX Zheng, J Lloyd, MI Jordan… - International …, 2004 - ieeexplore.ieee.org

We present a decision tree learning approach to diagnosing failures in large Internet sites.
We record runtime properties of each request and apply automated machine learning and …

Save Cite Cited by 620 Related articles All 20 versions Free GPT-4

[Free GPT-4]

[PDF] acm.org

Canopy: An end-to-end performance tracing and analysis system

J Kaldor, J Mace, M Bejda, E Gao… - Proceedings of the 26th …, 2017 - dl.acm.org

This paper presents Canopy, Facebook's end-to-end performance tracing infrastructure.
Canopy records causally related performance data across the end-to-end execution path of …

Save Cite Cited by 221 Related articles All 6 versions Free GPT-4

[Free GPT-4]

[PDF] cmu.edu

[PDF][PDF] Correlating Instrumentation Data to System States: A Building Block for Automated Diagnosis and Control.

I Cohen, JS Chase, M Goldszmidt, T Kelly, J Symons - OSDI, 2004 - users.ece.cmu.edu

This paper studies the use of statistical induction techniques as a basis for automated
performance diagnosis and performance management. The goal of the work is to develop …

Save Cite Cited by 689 Related articles All 17 versions Free GPT-4 View as HTML

Create alert

Cite

Advanced search

Saved to My library

Path-based failure and evolution management

A survey of online failure prediction methods

Service-generated big data and big data-as-a-service: an overview

Detecting large-scale system problems by mining console logs

Using Magpie for request extraction and workload modelling.

Pivot tracing: Dynamic causal monitoring for distributed systems

A survey of aiops methods for failure management

Enjoy your observability: an industrial survey of microservice tracing and analysis

Failure diagnosis using decision trees

Canopy: An end-to-end performance tracing and analysis system

[PDF][PDF] Correlating Instrumentation Data to System States: A Building Block for Automated Diagnosis and Control.