Serving {DNNs} like clockwork: Performance predictability from the bottom up

A Gujarati, R Karimi, S Alzayat, W Hao… - … USENIX Symposium on …, 2020 - usenix.org
Machine learning inference is becoming a core building block for interactive web
applications. As a result, the underlying model serving systems on which these applications …

Amazon Redshift re-invented

N Armenatzoglou, S Basu, N Bhanoori, M Cai… - Proceedings of the …, 2022 - dl.acm.org
In 2013, AmazonWeb Services revolutionized the data warehousing industry by launching
Amazon Redshift, the first fully-managed, petabyte-scale, enterprise-grade cloud data …

Swisslog: Robust and unified deep learning based log anomaly detection for diverse faults

X Li, P Chen, L **g, Z He, G Yu - 2020 IEEE 31st International …, 2020 - ieeexplore.ieee.org
Log-based anomaly detection has been widely studied and achieves a satisfying
performance on stable log data. But, the existing approaches still fall short meeting these …

Automap: Diagnose your microservice-based web applications automatically

M Ma, J Xu, Y Wang, P Chen, Z Zhang… - Proceedings of The Web …, 2020 - dl.acm.org
The high complexity and dynamics of the microservice architecture make its application
diagnosis extremely challenging. Static troubleshooting approaches may fail to obtain …

Towards {Domain-Specific} network transport for distributed {DNN} training

H Wang, H Tian, J Chen, X Wan, J **a, G Zeng… - … USENIX Symposium on …, 2024 - usenix.org
The nature of machine learning (ML) applications exposes rich characteristics to underlying
network transport, yet little work has been done so far to systematically exploit these …

Fail-slow at scale: Evidence of hardware performance faults in large production systems

HS Gunawi, RO Suminto, R Sears, C Golliher… - ACM Transactions on …, 2018 - dl.acm.org
Fail-slow hardware is an under-studied failure mode. We present a study of 114 reports of
fail-slow hardware incidents, collected from large-scale cluster deployments in 14 …

Heterogeneous anomaly detection for software systems via semi-supervised cross-modal attention

C Lee, T Yang, Z Chen, Y Su, Y Yang… - 2023 IEEE/ACM 45th …, 2023 - ieeexplore.ieee.org
Prompt and accurate detection of system anomalies is essential to ensure the reliability of
software systems. Unlike manual efforts that exploit all available run-time information …

Towards intelligent incident management: why we need it and how we make it

Z Chen, Y Kang, L Li, X Zhang, H Zhang, H Xu… - Proceedings of the 28th …, 2020 - dl.acm.org
The management of cloud service incidents (unplanned interruptions or outages of a
service/product) greatly affects customer satisfaction and business revenue. After years of …

Taurus: a data plane architecture for per-packet ML

T Swamy, A Rucker, M Shahbaz, I Gaur… - Proceedings of the 27th …, 2022 - dl.acm.org
Emerging applications---cloud computing, the internet of things, and augmented/virtual
reality---demand responsive, secure, and scalable datacenter networks. These networks …

{NetBouncer}: Active device and link failure localization in data center networks

C Tan, Z **, C Guo, T Zhang, H Wu, K Deng… - … USENIX Symposium on …, 2019 - usenix.org
The availability of data center services is jeopardized by various network incidents. One of
the biggest challenges for network incident handling is to accurately localize the failures …