Olive: Accelerating large language models via hardware-friendly outlier-victim pair quantization

C Guo, J Tang, W Hu, J Leng, C Zhang… - Proceedings of the 50th …, 2023 - dl.acm.org
Transformer-based large language models (LLMs) have achieved great success with the
growing model size. LLMs' size grows by 240× every two years, which outpaces the …

Serving heterogeneous machine learning models on {Multi-GPU} servers with {Spatio-Temporal} sharing

S Choi, S Lee, Y Kim, J Park, Y Kwon… - 2022 USENIX Annual …, 2022 - usenix.org
As machine learning (ML) techniques are applied to a widening range of applications, high
throughput ML inference serving has become critical for online services. Such ML inference …

Neurosurgeon: Collaborative intelligence between the cloud and mobile edge

Y Kang, J Hauswald, C Gao, A Rovinski… - ACM SIGARCH …, 2017 - dl.acm.org
The computation for today's intelligent personal assistants such as Apple Siri, Google Now,
and Microsoft Cortana, is performed in the cloud. This cloud-only approach requires …

Analysis of {Large-Scale}{Multi-Tenant}{GPU} clusters for {DNN} training workloads

M Jeon, S Venkataraman, A Phanishayee… - 2019 USENIX Annual …, 2019 - usenix.org
With widespread advances in machine learning, a number of large enterprises are
beginning to incorporate machine learning models across a number of products. These …

Orion: Interference-aware, fine-grained GPU sharing for ML applications

F Strati, X Ma, A Klimovic - … of the Nineteenth European Conference on …, 2024 - dl.acm.org
GPUs are critical for maximizing the throughput-per-Watt of deep neural network (DNN)
applications. However, DNN applications often underutilize GPUs, even when using large …

Adaptive resource efficient microservice deployment in cloud-edge continuum

K Fu, W Zhang, Q Chen, D Zeng… - IEEE Transactions on …, 2021 - ieeexplore.ieee.org
User-facing services are now evolving towards the microservice architecture where a
service is built by connecting multiple microservice stages. Since the entire service is heavy …

{AntMan}: Dynamic scaling on {GPU} clusters for deep learning

W **ao, S Ren, Y Li, Y Zhang, P Hou, Z Li… - … USENIX Symposium on …, 2020 - usenix.org
Efficiently scheduling deep learning jobs on large-scale GPU clusters is crucial for job
performance, system throughput, and hardware utilization. It is getting ever more …

Planaria: Dynamic architecture fission for spatial multi-tenant acceleration of deep neural networks

S Ghodrati, BH Ahn, JK Kim, S Kinzer… - 2020 53rd Annual …, 2020 - ieeexplore.ieee.org
Deep Neural Networks (DNNs) have reinvigorated real-world applications that rely on
learning patterns of data and are permeating into different industries and markets. Cloud …

Grandslam: Guaranteeing slas for jobs in microservices execution frameworks

RS Kannan, L Subramanian, A Raju, J Ahn… - Proceedings of the …, 2019 - dl.acm.org
The microservice architecture has dramatically reduced user effort in adopting and
maintaining servers by providing a catalog of functions as services that can be used as …

Prema: A predictive multi-task scheduling algorithm for preemptible neural processing units

Y Choi, M Rhu - 2020 IEEE International Symposium on High …, 2020 - ieeexplore.ieee.org
To amortize cost, cloud vendors providing DNN acceleration as a service to end-users
employ consolidation and virtualization to share the underlying resources among multiple …