Cluster frameworks for efficient scheduling and resource allocation in data center networks: A survey

K Wang, Q Zhou, S Guo, J Luo - IEEE Communications Surveys …, 2018 - ieeexplore.ieee.org
Data centers are widely used for big data analytics, which often involve data-parallel jobs,
including query and web service. Meanwhile, cluster frameworks are rapidly developed for …

The {CASE} of {FEMU}: Cheap, accurate, scalable and extensible flash emulator

H Li, M Hao, MH Tong, S Sundararaman… - … USENIX Conference on …, 2018 - usenix.org
We present FEMU, a QEMU-based flash emulator for fostering future full-stack
software/hardware SSD research, with the following four" CASE" benefits. FEMU is cheap …

Fail-slow at scale: Evidence of hardware performance faults in large production systems

HS Gunawi, RO Suminto, R Sears, C Golliher… - ACM Transactions on …, 2018 - dl.acm.org
Fail-slow hardware is an under-studied failure mode. We present a study of 114 reports of
fail-slow hardware incidents, collected from large-scale cluster deployments in 14 …

A2tp: Aggregator-aware in-network aggregation for multi-tenant learning

Z Li, J Huang, Y Li, A Xu, S Zhou, J Liu… - Proceedings of the …, 2023 - dl.acm.org
Distributed Machine Learning (DML) techniques are widely used to accelerate the training of
large-scale machine learning models. However, during training iterations, gradients need to …

Perseus: A {Fail-Slow} detection framework for cloud storage systems

R Lu, E Xu, Y Zhang, F Zhu, Z Zhu, M Wang… - … USENIX Conference on …, 2023 - usenix.org
The newly-emerging''fail-slow''failures plague both software and hardware where the victim
components are still functioning yet with degraded performance. To address this problem …

MittOS: Supporting millisecond tail tolerance with fast rejecting SLO-aware OS interface

M Hao, H Li, MH Tong, C Pakha, RO Suminto… - Proceedings of the 26th …, 2017 - dl.acm.org
MittOS provides operating system support to cut millisecond-level tail latencies for data-
parallel applications. In MittOS, we advocate a new principle that operating system should …

Managing tail latency in datacenter-scale file systems under production constraints

PA Misra, MF Borge, Í Goiri, AR Lebeck… - Proceedings of the …, 2019 - dl.acm.org
Distributed file systems often exhibit high tail latencies, especially in large-scale datacenters
and in the presence of competing (and possibly higher priority) workloads. This paper …

{IASO}: A {Fail-Slow} Detection and Mitigation Framework for Distributed Storage Services

B Panda, D Srinivasan, H Ke, K Gupta, V Khot… - 2019 USENIX Annual …, 2019 - usenix.org
We address the problem of “fail-slow” fault, a fault where a hardware or software component
can still function (does not fail-stop) but in much lower performance than expected. To …

Reducing tail latency using duplication: A multi-layered approach

HM Bashir, AB Faisal, MA Jamshed… - Proceedings of the 15th …, 2019 - dl.acm.org
Duplication can be a powerful strategy for overcoming stragglers in cloud services, but is
often used conservatively because of the risk of overloading the system. We call for making …

[PDF][PDF] The University of Chicago

W KIM - 2019 - newtraell.cs.uchicago.edu
ABSTRACT In the Node-Disjoint Paths problem (NDP), the input is an undirected n-vertex
graph G, and a collection {(s1, t1),...,(sk, tk)} of demand pairs. The goal is to route the largest …