{InfiniGen}: Efficient generative inference of large language models with dynamic {KV} cache management

W Lee, J Lee, J Seo, J Sim - 18th USENIX Symposium on Operating …, 2024 - usenix.org
Transformer-based large language models (LLMs) demonstrate impressive performance
across various natural language processing tasks. Serving LLM inference for generating …

Enabling resource-efficient aiot system with cross-level optimization: A survey

S Liu, B Guo, C Fang, Z Wang, S Luo… - … Surveys & Tutorials, 2023 - ieeexplore.ieee.org
The emerging field of artificial intelligence of things (AIoT, AI+ IoT) is driven by the
widespread use of intelligent infrastructures and the impressive success of deep learning …

Zero-infinity: Breaking the gpu memory wall for extreme scale deep learning

S Rajbhandari, O Ruwase, J Rasley, S Smith… - Proceedings of the …, 2021 - dl.acm.org
In the last three years, the largest dense deep learning models have grown over 1000x to
reach hundreds of billions of parameters, while the GPU memory has only grown by 5x (16 …

{Zero-offload}: Democratizing {billion-scale} model training

J Ren, S Rajbhandari, RY Aminabadi… - 2021 USENIX Annual …, 2021 - usenix.org
Large-scale model training has been a playing ground for a limited few requiring complex
model refactoring and access to prohibitively expensive GPU clusters. ZeRO-Offload …

{nnScaler}:{Constraint-Guided} Parallelization Plan Generation for Deep Learning Training

Z Lin, Y Miao, Q Zhang, F Yang, Y Zhu, C Li… - … USENIX Symposium on …, 2024 - usenix.org
With the growing model size of deep neural networks (DNN), deep learning training is
increasingly relying on handcrafted search spaces to find efficient parallelization execution …

[HTML][HTML] Deep learning-based natural language processing in human-agent interaction: Applications, advancements and challenges

N Ahmed, AK Saha, MA Al Noman, JR Jim… - Natural Language …, 2024 - Elsevier
Abstract Human-Agent Interaction is at the forefront of rapid development, with integrating
deep learning techniques into natural language processing representing significant …

{AntMan}: Dynamic scaling on {GPU} clusters for deep learning

W **ao, S Ren, Y Li, Y Zhang, P Hou, Z Li… - … USENIX Symposium on …, 2020 - usenix.org
Efficiently scheduling deep learning jobs on large-scale GPU clusters is crucial for job
performance, system throughput, and hardware utilization. It is getting ever more …

POET: Training neural networks on tiny devices with integrated rematerialization and paging

SG Patil, P Jain, P Dutta, I Stoica… - … on Machine Learning, 2022 - proceedings.mlr.press
Fine-tuning models on edge devices like mobile phones would enable privacy-preserving
personalization over sensitive data. However, edge training has historically been limited to …

Melon: Breaking the memory wall for resource-efficient on-device machine learning

Q Wang, M Xu, C **, X Dong, J Yuan, X **… - Proceedings of the 20th …, 2022 - dl.acm.org
On-device learning is a promising technique for emerging privacy-preserving machine
learning paradigms. However, through quantitative experiments, we find that commodity …

BPIPE: memory-balanced pipeline parallelism for training large language models

T Kim, H Kim, GI Yu, BG Chun - International Conference on …, 2023 - proceedings.mlr.press
Pipeline parallelism is a key technique for training large language models within GPU
clusters. However, it often leads to a memory imbalance problem, where certain GPUs face …