{InfiniGen}: Efficient generative inference of large language models with dynamic {KV} cache management
Transformer-based large language models (LLMs) demonstrate impressive performance
across various natural language processing tasks. Serving LLM inference for generating …
across various natural language processing tasks. Serving LLM inference for generating …
Enabling resource-efficient aiot system with cross-level optimization: A survey
The emerging field of artificial intelligence of things (AIoT, AI+ IoT) is driven by the
widespread use of intelligent infrastructures and the impressive success of deep learning …
widespread use of intelligent infrastructures and the impressive success of deep learning …
Zero-infinity: Breaking the gpu memory wall for extreme scale deep learning
In the last three years, the largest dense deep learning models have grown over 1000x to
reach hundreds of billions of parameters, while the GPU memory has only grown by 5x (16 …
reach hundreds of billions of parameters, while the GPU memory has only grown by 5x (16 …
{Zero-offload}: Democratizing {billion-scale} model training
Large-scale model training has been a playing ground for a limited few requiring complex
model refactoring and access to prohibitively expensive GPU clusters. ZeRO-Offload …
model refactoring and access to prohibitively expensive GPU clusters. ZeRO-Offload …
{nnScaler}:{Constraint-Guided} Parallelization Plan Generation for Deep Learning Training
With the growing model size of deep neural networks (DNN), deep learning training is
increasingly relying on handcrafted search spaces to find efficient parallelization execution …
increasingly relying on handcrafted search spaces to find efficient parallelization execution …
[HTML][HTML] Deep learning-based natural language processing in human-agent interaction: Applications, advancements and challenges
Abstract Human-Agent Interaction is at the forefront of rapid development, with integrating
deep learning techniques into natural language processing representing significant …
deep learning techniques into natural language processing representing significant …
{AntMan}: Dynamic scaling on {GPU} clusters for deep learning
Efficiently scheduling deep learning jobs on large-scale GPU clusters is crucial for job
performance, system throughput, and hardware utilization. It is getting ever more …
performance, system throughput, and hardware utilization. It is getting ever more …
POET: Training neural networks on tiny devices with integrated rematerialization and paging
Fine-tuning models on edge devices like mobile phones would enable privacy-preserving
personalization over sensitive data. However, edge training has historically been limited to …
personalization over sensitive data. However, edge training has historically been limited to …
Melon: Breaking the memory wall for resource-efficient on-device machine learning
On-device learning is a promising technique for emerging privacy-preserving machine
learning paradigms. However, through quantitative experiments, we find that commodity …
learning paradigms. However, through quantitative experiments, we find that commodity …
BPIPE: memory-balanced pipeline parallelism for training large language models
Pipeline parallelism is a key technique for training large language models within GPU
clusters. However, it often leads to a memory imbalance problem, where certain GPUs face …
clusters. However, it often leads to a memory imbalance problem, where certain GPUs face …