X-ray: Automating {Root-Cause} diagnosis of performance anomalies in production software

M Attariyan, M Chow, J Flinn - 10th USENIX Symposium on Operating …, 2012 - usenix.org
Troubleshooting the performance of production software is challenging. Most existing tools,
such as profiling, tracing, and logging systems, reveal what events occurred during …

A checkpoint of research on parallel i/o for high-performance computing

FZ Boito, EC Inacio, JL Bez, POA Navaux… - ACM Computing …, 2018 - dl.acm.org
We present a comprehensive survey on parallel I/O in the high-performance computing
(HPC) context. This is an important field for HPC because of the historic gap between …

Fail-slow at scale: Evidence of hardware performance faults in large production systems

HS Gunawi, RO Suminto, R Sears, C Golliher… - ACM Transactions on …, 2018 - dl.acm.org
Fail-slow hardware is an under-studied failure mode. We present a study of 114 reports of
fail-slow hardware incidents, collected from large-scale cluster deployments in 14 …

Ubl: Unsupervised behavior learning for predicting performance anomalies in virtualized cloud systems

DJ Dean, H Nguyen, X Gu - … of the 9th international conference on …, 2012 - dl.acm.org
Infrastructure-as-a-Service (IaaS) clouds are prone to performance anomalies due to their
complex nature. Although previous work has shown the effectiveness of using statistical …

[PDF][PDF] Diagnosing performance changes by comparing request flows

RR Sambasivan, AX Zheng, M De Rosa… - … USENIX Symposium on …, 2011 - usenix.org
The causes of performance changes in a distributed system often elude even its developers.
This paper develops a new technique for gaining insight into such changes: comparing …

Managing variability in the IO performance of petascale storage systems

J Lofstead, F Zheng, Q Liu, S Klasky… - SC'10: Proceedings …, 2010 - ieeexplore.ieee.org
Significant challenges exist for achieving peak or even consistent levels of performance
when using IO systems at scale. They stem from sharing IO system resources across the …

Sifter: Scalable sampling for distributed traces, without feature engineering

P Las-Casas, G Papakerashvili, V Anand… - Proceedings of the ACM …, 2019 - dl.acm.org
Distributed tracing is a core component of cloud and datacenter systems, and provides
visibility into their end-to-end runtime behavior. To reduce computational and storage …

Statistical debugging for real-world performance problems

L Song, S Lu - ACM SIGPLAN Notices, 2014 - dl.acm.org
Design and implementation defects that lead to inefficient computation widely exist in
software. These defects are difficult to avoid and discover. They lead to severe performance …

Limplock: Understanding the impact of limpware on scale-out cloud systems

T Do, M Hao, T Leesatapornwongsa… - Proceedings of the 4th …, 2013 - dl.acm.org
We highlight one often-overlooked cause of performance failure: limpware--" lim**"
hardware whose performance degrades significantly compared to its specification. We …

{IASO}: A {Fail-Slow} Detection and Mitigation Framework for Distributed Storage Services

B Panda, D Srinivasan, H Ke, K Gupta, V Khot… - 2019 USENIX Annual …, 2019 - usenix.org
We address the problem of “fail-slow” fault, a fault where a hardware or software component
can still function (does not fail-stop) but in much lower performance than expected. To …