X-ray: Automating {Root-Cause} diagnosis of performance anomalies in production software
M Attariyan, M Chow, J Flinn - 10th USENIX Symposium on Operating …, 2012 - usenix.org
Troubleshooting the performance of production software is challenging. Most existing tools,
such as profiling, tracing, and logging systems, reveal what events occurred during …
such as profiling, tracing, and logging systems, reveal what events occurred during …
A checkpoint of research on parallel i/o for high-performance computing
We present a comprehensive survey on parallel I/O in the high-performance computing
(HPC) context. This is an important field for HPC because of the historic gap between …
(HPC) context. This is an important field for HPC because of the historic gap between …
Fail-slow at scale: Evidence of hardware performance faults in large production systems
Fail-slow hardware is an under-studied failure mode. We present a study of 114 reports of
fail-slow hardware incidents, collected from large-scale cluster deployments in 14 …
fail-slow hardware incidents, collected from large-scale cluster deployments in 14 …
Ubl: Unsupervised behavior learning for predicting performance anomalies in virtualized cloud systems
Infrastructure-as-a-Service (IaaS) clouds are prone to performance anomalies due to their
complex nature. Although previous work has shown the effectiveness of using statistical …
complex nature. Although previous work has shown the effectiveness of using statistical …
[PDF][PDF] Diagnosing performance changes by comparing request flows
The causes of performance changes in a distributed system often elude even its developers.
This paper develops a new technique for gaining insight into such changes: comparing …
This paper develops a new technique for gaining insight into such changes: comparing …
Managing variability in the IO performance of petascale storage systems
Significant challenges exist for achieving peak or even consistent levels of performance
when using IO systems at scale. They stem from sharing IO system resources across the …
when using IO systems at scale. They stem from sharing IO system resources across the …
Sifter: Scalable sampling for distributed traces, without feature engineering
Distributed tracing is a core component of cloud and datacenter systems, and provides
visibility into their end-to-end runtime behavior. To reduce computational and storage …
visibility into their end-to-end runtime behavior. To reduce computational and storage …
Statistical debugging for real-world performance problems
Design and implementation defects that lead to inefficient computation widely exist in
software. These defects are difficult to avoid and discover. They lead to severe performance …
software. These defects are difficult to avoid and discover. They lead to severe performance …
Limplock: Understanding the impact of limpware on scale-out cloud systems
We highlight one often-overlooked cause of performance failure: limpware--" lim**"
hardware whose performance degrades significantly compared to its specification. We …
hardware whose performance degrades significantly compared to its specification. We …
{IASO}: A {Fail-Slow} Detection and Mitigation Framework for Distributed Storage Services
We address the problem of “fail-slow” fault, a fault where a hardware or software component
can still function (does not fail-stop) but in much lower performance than expected. To …
can still function (does not fail-stop) but in much lower performance than expected. To …