Hermod: principled and practical scheduling for serverless functions

K Kaffes, NJ Yadwadkar, C Kozyrakis - … of the 13th Symposium on Cloud …, 2022 - dl.acm.org
Serverless computing has seen rapid growth due to the ease-of-use and cost-efficiency it
provides. However, function scheduling, a critical component of serverless systems, has …

Characterizing and synthesizing task dependencies of data-parallel jobs in alibaba cloud

H Tian, Y Zheng, W Wang - Proceedings of the ACM Symposium on …, 2019 - dl.acm.org
Cluster schedulers routinely face data-parallel jobs with complex task dependencies
expressed as DAGs (directed acyclic graphs). Understanding DAG structures and runtime …

Size-aware sharding for improving tail latencies in in-memory key-value stores

D Didona, W Zwaenepoel - 16th USENIX Symposium on Networked …, 2019 - usenix.org
This paper introduces the concept of size-aware sharding to improve tail latencies for in-
memory key-value stores, and describes its implementation in the Minos key-value store …

?-diagnosis: Unsupervised and real-time diagnosis of small-window long-tail latency in large-scale microservice platforms

H Shan, Y Chen, H Liu, Y Zhang, X **ao, X He… - The World Wide Web …, 2019 - dl.acm.org
Microservice architectures and container technologies are broadly adopted by giant internet
companies to support their web services, which typically have a strict service-level objective …

Progress-based container scheduling for short-lived applications in a kubernetes cluster

Y Fu, S Zhang, J Terrero, Y Mao, G Liu… - … Conference on Big …, 2019 - ieeexplore.ieee.org
In the past decade, we have envisioned enormous growth in the data generated by different
sources, ranging from weather sensors and customer purchasing records to Internet of …

Understanding and optimizing workloads for unified resource management in large cloud platforms

C Lu, H Xu, K Ye, G Xu, L Zhang, G Yang… - Proceedings of the …, 2023 - dl.acm.org
To fully utilize computing resources, cloud providers such as Google and Alibaba choose to
co-locate online services with batch processing applications in their data centers. By …

SLearn: A Case for Task Sampling Based Learning for Cluster Job Scheduling

A Jajoo, YC Hu, X Lin, N Deng - IEEE Transactions on Cloud …, 2022 - ieeexplore.ieee.org
The ability to accurately estimate job runtime properties allows a scheduler to effectively
schedule jobs. State-of-the-art online cluster job schedulers use history-based learning …

Switches for HIRE: Resource scheduling for data center in-network computing

M Blöcher, L Wang, P Eugster, M Schmidt - Proceedings of the 26th ACM …, 2021 - dl.acm.org
The recent trend towards more programmable switching hardware in data centers opens up
new possibilities for distributed applications to leverage in-network computing (INC) …

Pigeon: An effective distributed, hierarchical datacenter job scheduler

Z Wang, H Li, Z Li, X Sun, J Rao, H Che… - Proceedings of the ACM …, 2019 - dl.acm.org
In today's datacenters, job heterogeneity makes it difficult for schedulers to simultaneously
meet latency requirements and maintain high resource utilization. The state-of-the-art …

Job scheduling for large-scale machine learning clusters

H Wang, Z Liu, H Shen - … of the 16th International Conference on …, 2020 - dl.acm.org
With the rapid proliferation of Machine Learning (ML) and Deep learning (DL) applications
running on modern platforms, it is crucial to satisfy application performance requirements …