- Academic Search

M Attariyan, M Chow, J Flinn - 10th USENIX Symposium on Operating …, 2012 - usenix.org

Troubleshooting the performance of production software is challenging. Most existing tools,
such as profiling, tracing, and logging systems, reveal what events occurred during …

Save Cite Cited by 350 Related articles All 18 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] univ-grenoble-alpes.fr

A checkpoint of research on parallel i/o for high-performance computing

FZ Boito, EC Inacio, JL Bez, POA Navaux… - ACM Computing …, 2018 - dl.acm.org

We present a comprehensive survey on parallel I/O in the high-performance computing
(HPC) context. This is an important field for HPC because of the historic gap between …

Save Cite Cited by 56 Related articles All 7 versions Free GPT-4

[Free GPT-4]

[PDF] acm.org

Fail-slow at scale: Evidence of hardware performance faults in large production systems

HS Gunawi, RO Suminto, R Sears, C Golliher… - ACM Transactions on …, 2018 - dl.acm.org

Fail-slow hardware is an under-studied failure mode. We present a study of 114 reports of
fail-slow hardware incidents, collected from large-scale cluster deployments in 14 …

Save Cite Cited by 174 Related articles All 13 versions Free GPT-4

[Free GPT-4]

[PDF] psu.edu

Ubl: Unsupervised behavior learning for predicting performance anomalies in virtualized cloud systems

DJ Dean, H Nguyen, X Gu - … of the 9th international conference on …, 2012 - dl.acm.org

Infrastructure-as-a-Service (IaaS) clouds are prone to performance anomalies due to their
complex nature. Although previous work has shown the effectiveness of using statistical …

Save Cite Cited by 241 Related articles All 6 versions Free GPT-4

[Free GPT-4]

[PDF] usenix.org

[PDF][PDF] Diagnosing performance changes by comparing request flows

RR Sambasivan, AX Zheng, M De Rosa… - … USENIX Symposium on …, 2011 - usenix.org

The causes of performance changes in a distributed system often elude even its developers.
This paper develops a new technique for gaining insight into such changes: comparing …

Save Cite Cited by 266 Related articles All 22 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] academia.edu

Managing variability in the IO performance of petascale storage systems

J Lofstead, F Zheng, Q Liu, S Klasky… - SC'10: Proceedings …, 2010 - ieeexplore.ieee.org

Significant challenges exist for achieving peak or even consistent levels of performance
when using IO systems at scale. They stem from sharing IO system resources across the …

Save Cite Cited by 249 Related articles All 14 versions Free GPT-4

[Free GPT-4]

[PDF] acm.org

Sifter: Scalable sampling for distributed traces, without feature engineering

P Las-Casas, G Papakerashvili, V Anand… - Proceedings of the ACM …, 2019 - dl.acm.org

Distributed tracing is a core component of cloud and datacenter systems, and provides
visibility into their end-to-end runtime behavior. To reduce computational and storage …

Save Cite Cited by 49 Related articles All 5 versions Free GPT-4

[Free GPT-4]

[PDF] wisconsin.edu

Statistical debugging for real-world performance problems

L Song, S Lu - ACM SIGPLAN Notices, 2014 - dl.acm.org

Design and implementation defects that lead to inefficient computation widely exist in
software. These defects are difficult to avoid and discover. They lead to severe performance …

Save Cite Cited by 115 Related articles All 17 versions Free GPT-4

[Free GPT-4]

[PDF] uchicago.edu

Limplock: Understanding the impact of limpware on scale-out cloud systems

T Do, M Hao, T Leesatapornwongsa… - Proceedings of the 4th …, 2013 - dl.acm.org

We highlight one often-overlooked cause of performance failure: limpware--" lim**"
hardware whose performance degrades significantly compared to its specification. We …

Save Cite Cited by 95 Related articles All 5 versions Free GPT-4

[Free GPT-4]

[PDF] usenix.org

{IASO}: A {Fail-Slow} Detection and Mitigation Framework for Distributed Storage Services

B Panda, D Srinivasan, H Ke, K Gupta, V Khot… - 2019 USENIX Annual …, 2019 - usenix.org

We address the problem of “fail-slow” fault, a fault where a hardware or software component
can still function (does not fail-stop) but in much lower performance than expected. To …

Save Cite Cited by 45 Related articles All 10 versions Free GPT-4 View as HTML

Create alert

Cite

Advanced search

Saved to My library

Black-Box Problem Diagnosis in Parallel File Systems.

X-ray: Automating {Root-Cause} diagnosis of performance anomalies in production software

A checkpoint of research on parallel i/o for high-performance computing

Fail-slow at scale: Evidence of hardware performance faults in large production systems

Ubl: Unsupervised behavior learning for predicting performance anomalies in virtualized cloud systems

[PDF][PDF] Diagnosing performance changes by comparing request flows

Managing variability in the IO performance of petascale storage systems

Sifter: Scalable sampling for distributed traces, without feature engineering

Statistical debugging for real-world performance problems

Limplock: Understanding the impact of limpware on scale-out cloud systems

{IASO}: A {Fail-Slow} Detection and Mitigation Framework for Distributed Storage Services