Large-scale cluster management at Google with Borg
Google's Borg system is a cluster manager that runs hundreds of thousands of jobs, from
many thousands of different applications, across a number of clusters each with up to tens of …
many thousands of different applications, across a number of clusters each with up to tens of …
Heracles: Improving resource efficiency at scale
User-facing, latency-sensitive services, such as websearch, underutilize their computing
resources during daily periods of low traffic. Reusing those resources for other tasks is rarely …
resources during daily periods of low traffic. Reusing those resources for other tasks is rarely …
Characterization and prediction of deep learning workloads in large-scale gpu datacenters
Modern GPU datacenters are critical for delivering Deep Learning (DL) models and services
in both the research community and industry. When operating a datacenter, optimization of …
in both the research community and industry. When operating a datacenter, optimization of …
Morpheus: Towards automated {SLOs} for enterprise clusters
Modern resource management frameworks for largescale analytics leave unresolved the
problematic tension between high cluster utilization and job's performance predictability …
problematic tension between high cluster utilization and job's performance predictability …
Multi-tenant cloud data services: state-of-the-art, challenges and opportunities
Enterprises are moving their business-critical workloads to public clouds at an accelerating
pace. Multi-tenancy is a crucial tenet for cloud data service providers allowing them to …
pace. Multi-tenancy is a crucial tenet for cloud data service providers allowing them to …
{GRAPHENE}: Packing and {Dependency-Aware} scheduling for {Data-Parallel} clusters
We present a new cluster scheduler, GRAPHENE, aimed at jobs that have a complex
dependency structure and heterogeneous resource demands. Relaxing either of these …
dependency structure and heterogeneous resource demands. Relaxing either of these …
TetriSched: global rescheduling with adaptive plan-ahead in dynamic heterogeneous clusters
TetriSched is a scheduler that works in tandem with a calendaring reservation system to
continuously re-evaluate the immediate-term scheduling plan for all pending jobs (including …
continuously re-evaluate the immediate-term scheduling plan for all pending jobs (including …
Slaq: quality-driven scheduling for distributed machine learning
Training machine learning (ML) models with large datasets can incur significant resource
contention on shared clusters. This training typically involves many iterations that continually …
contention on shared clusters. This training typically involves many iterations that continually …
Mercury: Hybrid centralized and distributed scheduling in large shared clusters
Datacenter-scale computing for analytics workloads is increasingly common. High
operational costs force heterogeneous applications to share cluster resources for achieving …
operational costs force heterogeneous applications to share cluster resources for achieving …
The elasticity and plasticity in semi-containerized co-locating cloud workload: A view from alibaba trace
Q Liu, Z Yu - Proceedings of the ACM Symposium on Cloud …, 2018 - dl.acm.org
Cloud computing with large-scale datacenters provides great convenience and cost-
efficiency for end users. However, the resource utilization of cloud datacenters is very low …
efficiency for end users. However, the resource utilization of cloud datacenters is very low …