Performance interference of virtual machines: A survey

W Lin, C **ong, W Wu, F Shi, K Li, M Xu - ACM Computing Surveys, 2023 - dl.acm.org
The rapid development of cloud computing with virtualization technology has benefited both
academia and industry. For any cloud data center at scale, one of the primary challenges is …

Olive: Accelerating large language models via hardware-friendly outlier-victim pair quantization

C Guo, J Tang, W Hu, J Leng, C Zhang… - Proceedings of the 50th …, 2023 - dl.acm.org
Transformer-based large language models (LLMs) have achieved great success with the
growing model size. LLMs' size grows by 240× every two years, which outpaces the …

Llumnix: Dynamic scheduling for large language model serving

B Sun, Z Huang, H Zhao, W **ao, X Zhang… - … USENIX Symposium on …, 2024 - usenix.org
Inference serving for large language models (LLMs) is the key to unleashing their potential
in people's daily lives. However, efficient LLM serving remains challenging today because …

Serving heterogeneous machine learning models on {Multi-GPU} servers with {Spatio-Temporal} sharing

S Choi, S Lee, Y Kim, J Park, Y Kwon… - 2022 USENIX Annual …, 2022 - usenix.org
As machine learning (ML) techniques are applied to a widening range of applications, high
throughput ML inference serving has become critical for online services. Such ML inference …

Gandiva: Introspective cluster scheduling for deep learning

W **ao, R Bhardwaj, R Ramjee, M Sivathanu… - … USENIX Symposium on …, 2018 - usenix.org
We introduce Gandiva, a new cluster scheduling framework that utilizes domain-specific
knowledge to improve latency and efficiency of training deep learning models in a GPU …

Analysis of {Large-Scale}{Multi-Tenant}{GPU} clusters for {DNN} training workloads

M Jeon, S Venkataraman, A Phanishayee… - 2019 USENIX Annual …, 2019 - usenix.org
With widespread advances in machine learning, a number of large enterprises are
beginning to incorporate machine learning models across a number of products. These …

Adaptive resource efficient microservice deployment in cloud-edge continuum

K Fu, W Zhang, Q Chen, D Zeng… - IEEE Transactions on …, 2021 - ieeexplore.ieee.org
User-facing services are now evolving towards the microservice architecture where a
service is built by connecting multiple microservice stages. Since the entire service is heavy …

Faasflow: Enable efficient workflow execution for function-as-a-service

Z Li, Y Liu, L Guo, Q Chen, J Cheng, W Zheng… - Proceedings of the 27th …, 2022 - dl.acm.org
Serverless computing (Function-as-a-Service) provides fine-grain resource sharing by
running functions (or Lambdas) in containers. Data-dependent functions are required to be …

{AntMan}: Dynamic scaling on {GPU} clusters for deep learning

W **ao, S Ren, Y Li, Y Zhang, P Hou, Z Li… - … USENIX Symposium on …, 2020 - usenix.org
Efficiently scheduling deep learning jobs on large-scale GPU clusters is crucial for job
performance, system throughput, and hardware utilization. It is getting ever more …

Planaria: Dynamic architecture fission for spatial multi-tenant acceleration of deep neural networks

S Ghodrati, BH Ahn, JK Kim, S Kinzer… - 2020 53rd Annual …, 2020 - ieeexplore.ieee.org
Deep Neural Networks (DNNs) have reinvigorated real-world applications that rely on
learning patterns of data and are permeating into different industries and markets. Cloud …