- Academic Search

A Gujarati, R Karimi, S Alzayat, W Hao… - … USENIX Symposium on …, 2020 - usenix.org

Machine learning inference is becoming a core building block for interactive web
applications. As a result, the underlying model serving systems on which these applications …

Opslaan Citeren Geciteerd door 299 Verwante artikelen Alle 14 versies HTML-versie

[Free GPT-4]
[DeepSeek]

[PDF] acm.org

Amazon Redshift re-invented

N Armenatzoglou, S Basu, N Bhanoori, M Cai… - Proceedings of the …, 2022 - dl.acm.org

In 2013, AmazonWeb Services revolutionized the data warehousing industry by launching
Amazon Redshift, the first fully-managed, petabyte-scale, enterprise-grade cloud data …

Opslaan Citeren Geciteerd door 91 Verwante artikelen Alle 7 versies

Swisslog: Robust and unified deep learning based log anomaly detection for diverse faults

X Li, P Chen, L **g, Z He, G Yu - 2020 IEEE 31st International …, 2020 - ieeexplore.ieee.org

Log-based anomaly detection has been widely studied and achieves a satisfying
performance on stable log data. But, the existing approaches still fall short meeting these …

Opslaan Citeren Geciteerd door 130 Verwante artikelen Alle 2 versies

Automap: Diagnose your microservice-based web applications automatically

M Ma, J Xu, Y Wang, P Chen, Z Zhang… - Proceedings of The Web …, 2020 - dl.acm.org

The high complexity and dynamics of the microservice architecture make its application
diagnosis extremely challenging. Static troubleshooting approaches may fail to obtain …

Opslaan Citeren Geciteerd door 133 Verwante artikelen Alle 2 versies

[Free GPT-4]
[DeepSeek]

[PDF] usenix.org

Towards {Domain-Specific} network transport for distributed {DNN} training

H Wang, H Tian, J Chen, X Wan, J **a, G Zeng… - … USENIX Symposium on …, 2024 - usenix.org

The nature of machine learning (ML) applications exposes rich characteristics to underlying
network transport, yet little work has been done so far to systematically exploit these …

Opslaan Citeren Geciteerd door 16 Verwante artikelen Alle 8 versies HTML-versie

[Free GPT-4]
[DeepSeek]

[PDF] acm.org

Fail-slow at scale: Evidence of hardware performance faults in large production systems

HS Gunawi, RO Suminto, R Sears, C Golliher… - ACM Transactions on …, 2018 - dl.acm.org

Fail-slow hardware is an under-studied failure mode. We present a study of 114 reports of
fail-slow hardware incidents, collected from large-scale cluster deployments in 14 …

Opslaan Citeren Geciteerd door 177 Verwante artikelen Alle 15 versies

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Heterogeneous anomaly detection for software systems via semi-supervised cross-modal attention

C Lee, T Yang, Z Chen, Y Su, Y Yang… - 2023 IEEE/ACM 45th …, 2023 - ieeexplore.ieee.org

Prompt and accurate detection of system anomalies is essential to ensure the reliability of
software systems. Unlike manual efforts that exploit all available run-time information …

Opslaan Citeren Geciteerd door 32 Verwante artikelen Alle 6 versies

[Free GPT-4]
[DeepSeek]

[PDF] aiops.org

Towards intelligent incident management: why we need it and how we make it

Z Chen, Y Kang, L Li, X Zhang, H Zhang, H Xu… - Proceedings of the 28th …, 2020 - dl.acm.org

The management of cloud service incidents (unplanned interruptions or outages of a
service/product) greatly affects customer satisfaction and business revenue. After years of …

Opslaan Citeren Geciteerd door 94 Verwante artikelen Alle 4 versies

[Free GPT-4]
[DeepSeek]

[PDF] acm.org

Taurus: a data plane architecture for per-packet ML

T Swamy, A Rucker, M Shahbaz, I Gaur… - Proceedings of the 27th …, 2022 - dl.acm.org

Emerging applications---cloud computing, the internet of things, and augmented/virtual
reality---demand responsive, secure, and scalable datacenter networks. These networks …

Opslaan Citeren Geciteerd door 94 Verwante artikelen Alle 5 versies

[Free GPT-4]
[DeepSeek]

[PDF] usenix.org

{NetBouncer}: Active device and link failure localization in data center networks

C Tan, Z **, C Guo, T Zhang, H Wu, K Deng… - … USENIX Symposium on …, 2019 - usenix.org

The availability of data center services is jeopardized by various network incidents. One of
the biggest challenges for network incident handling is to accurately localize the failures …

Opslaan Citeren Geciteerd door 127 Verwante artikelen Alle 14 versies HTML-versie

Melding maken

Citeren

Geavanceerd zoeken

Opgeslagen in Mijn bibliotheek

Gray failure: The achilles' heel of cloud-scale systems

Serving {DNNs} like clockwork: Performance predictability from the bottom up

Amazon Redshift re-invented

Swisslog: Robust and unified deep learning based log anomaly detection for diverse faults

Automap: Diagnose your microservice-based web applications automatically

Towards {Domain-Specific} network transport for distributed {DNN} training

Fail-slow at scale: Evidence of hardware performance faults in large production systems

Heterogeneous anomaly detection for software systems via semi-supervised cross-modal attention

Towards intelligent incident management: why we need it and how we make it

Taurus: a data plane architecture for per-packet ML

{NetBouncer}: Active device and link failure localization in data center networks