Large-scale cluster management at Google with Borg

A Verma, L Pedrosa, M Korupolu… - Proceedings of the …, 2015 - dl.acm.org
Google's Borg system is a cluster manager that runs hundreds of thousands of jobs, from
many thousands of different applications, across a number of clusters each with up to tens of …

Heracles: Improving resource efficiency at scale

D Lo, L Cheng, R Govindaraju… - Proceedings of the …, 2015 - dl.acm.org
User-facing, latency-sensitive services, such as websearch, underutilize their computing
resources during daily periods of low traffic. Reusing those resources for other tasks is rarely …

Characterization and prediction of deep learning workloads in large-scale gpu datacenters

Q Hu, P Sun, S Yan, Y Wen, T Zhang - Proceedings of the International …, 2021 - dl.acm.org
Modern GPU datacenters are critical for delivering Deep Learning (DL) models and services
in both the research community and industry. When operating a datacenter, optimization of …

Morpheus: Towards automated {SLOs} for enterprise clusters

SA Jyothi, C Curino, I Menache… - … USENIX symposium on …, 2016 - usenix.org
Modern resource management frameworks for largescale analytics leave unresolved the
problematic tension between high cluster utilization and job's performance predictability …

Multi-tenant cloud data services: state-of-the-art, challenges and opportunities

V Narasayya, S Chaudhuri - … of the 2022 International Conference on …, 2022 - dl.acm.org
Enterprises are moving their business-critical workloads to public clouds at an accelerating
pace. Multi-tenancy is a crucial tenet for cloud data service providers allowing them to …

{GRAPHENE}: Packing and {Dependency-Aware} scheduling for {Data-Parallel} clusters

R Grandl, S Kandula, S Rao, A Akella… - 12th USENIX Symposium …, 2016 - usenix.org
We present a new cluster scheduler, GRAPHENE, aimed at jobs that have a complex
dependency structure and heterogeneous resource demands. Relaxing either of these …

TetriSched: global rescheduling with adaptive plan-ahead in dynamic heterogeneous clusters

A Tumanov, T Zhu, JW Park, MA Kozuch… - Proceedings of the …, 2016 - dl.acm.org
TetriSched is a scheduler that works in tandem with a calendaring reservation system to
continuously re-evaluate the immediate-term scheduling plan for all pending jobs (including …

Slaq: quality-driven scheduling for distributed machine learning

H Zhang, L Stafman, A Or, MJ Freedman - Proceedings of the 2017 …, 2017 - dl.acm.org
Training machine learning (ML) models with large datasets can incur significant resource
contention on shared clusters. This training typically involves many iterations that continually …

Mercury: Hybrid centralized and distributed scheduling in large shared clusters

K Karanasos, S Rao, C Curino, C Douglas… - 2015 USENIX Annual …, 2015 - usenix.org
Datacenter-scale computing for analytics workloads is increasingly common. High
operational costs force heterogeneous applications to share cluster resources for achieving …

The elasticity and plasticity in semi-containerized co-locating cloud workload: A view from alibaba trace

Q Liu, Z Yu - Proceedings of the ACM Symposium on Cloud …, 2018 - dl.acm.org
Cloud computing with large-scale datacenters provides great convenience and cost-
efficiency for end users. However, the resource utilization of cloud datacenters is very low …