A survey of online failure prediction methods
F Salfner, M Lenk, M Malek - ACM Computing Surveys (CSUR), 2010 - dl.acm.org
With the ever-growing complexity and dynamicity of computer systems, proactive fault
management is an effective approach to enhancing availability. Online failure prediction is …
management is an effective approach to enhancing availability. Online failure prediction is …
Service-generated big data and big data-as-a-service: an overview
With the prevalence of service computing and cloud computing, more and more services are
emerging on the Internet, generating huge volume of data, such as trace logs, QoS …
emerging on the Internet, generating huge volume of data, such as trace logs, QoS …
Detecting large-scale system problems by mining console logs
Surprisingly, console logs rarely help operators detect problems in large-scale datacenter
services, for they often consist of the voluminous intermixing of messages from many …
services, for they often consist of the voluminous intermixing of messages from many …
Using Magpie for request extraction and workload modelling.
Tools to understand complex system behaviour are essential for many performance analysis
and debugging tasks, yet there are many open research problems in their development …
and debugging tasks, yet there are many open research problems in their development …
Pivot tracing: Dynamic causal monitoring for distributed systems
Monitoring and troubleshooting distributed systems is notoriously difficult; potential problems
are complex, varied, and unpredictable. The monitoring and diagnosis tools commonly used …
are complex, varied, and unpredictable. The monitoring and diagnosis tools commonly used …
A survey of aiops methods for failure management
Modern society is increasingly moving toward complex and distributed computing systems.
The increase in scale and complexity of these systems challenges O&M teams that perform …
The increase in scale and complexity of these systems challenges O&M teams that perform …
Enjoy your observability: an industrial survey of microservice tracing and analysis
Microservice systems are often deployed in complex cloud-based environments and may
involve a large number of service instances being dynamically created and destroyed. It is …
involve a large number of service instances being dynamically created and destroyed. It is …
Failure diagnosis using decision trees
We present a decision tree learning approach to diagnosing failures in large Internet sites.
We record runtime properties of each request and apply automated machine learning and …
We record runtime properties of each request and apply automated machine learning and …
Canopy: An end-to-end performance tracing and analysis system
J Kaldor, J Mace, M Bejda, E Gao… - Proceedings of the 26th …, 2017 - dl.acm.org
This paper presents Canopy, Facebook's end-to-end performance tracing infrastructure.
Canopy records causally related performance data across the end-to-end execution path of …
Canopy records causally related performance data across the end-to-end execution path of …
[PDF][PDF] Correlating Instrumentation Data to System States: A Building Block for Automated Diagnosis and Control.
This paper studies the use of statistical induction techniques as a basis for automated
performance diagnosis and performance management. The goal of the work is to develop …
performance diagnosis and performance management. The goal of the work is to develop …