Efficient memory management for large language model serving with pagedattention
High throughput serving of large language models (LLMs) requires batching sufficiently
many requests at a time. However, existing systems struggle because the key-value cache …
many requests at a time. However, existing systems struggle because the key-value cache …
Flexgen: High-throughput generative inference of large language models with a single gpu
The high computational and memory requirements of large language model (LLM) inference
make it feasible only with multiple high-end accelerators. Motivated by the emerging …
make it feasible only with multiple high-end accelerators. Motivated by the emerging …
Enabling resource-efficient aiot system with cross-level optimization: A survey
The emerging field of artificial intelligence of things (AIoT, AI+ IoT) is driven by the
widespread use of intelligent infrastructures and the impressive success of deep learning …
widespread use of intelligent infrastructures and the impressive success of deep learning …
{InfiniGen}: Efficient generative inference of large language models with dynamic {KV} cache management
Transformer-based large language models (LLMs) demonstrate impressive performance
across various natural language processing tasks. Serving LLM inference for generating …
across various natural language processing tasks. Serving LLM inference for generating …
A survey on scheduling techniques in computing and network convergence
S Tang, Y Yu, H Wang, G Wang, W Chen… - … Surveys & Tutorials, 2023 - ieeexplore.ieee.org
The computing demand for massive applications has led to the ubiquitous deployment of
computing power. This trend results in the urgent need for higher-level computing resource …
computing power. This trend results in the urgent need for higher-level computing resource …
[HTML][HTML] Pre-trained models: Past, present and future
Large-scale pre-trained models (PTMs) such as BERT and GPT have recently achieved
great success and become a milestone in the field of artificial intelligence (AI). Owing to …
great success and become a milestone in the field of artificial intelligence (AI). Owing to …
Zero-infinity: Breaking the gpu memory wall for extreme scale deep learning
In the last three years, the largest dense deep learning models have grown over 1000x to
reach hundreds of billions of parameters, while the GPU memory has only grown by 5x (16 …
reach hundreds of billions of parameters, while the GPU memory has only grown by 5x (16 …
{Zero-offload}: Democratizing {billion-scale} model training
Large-scale model training has been a playing ground for a limited few requiring complex
model refactoring and access to prohibitively expensive GPU clusters. ZeRO-Offload …
model refactoring and access to prohibitively expensive GPU clusters. ZeRO-Offload …
Pollux: Co-adaptive cluster scheduling for goodput-optimized deep learning
Pollux improves scheduling performance in deep learning (DL) clusters by adaptively co-
optimizing inter-dependent factors both at the per-job level and at the cluster-wide level …
optimizing inter-dependent factors both at the per-job level and at the cluster-wide level …
{nnScaler}:{Constraint-Guided} Parallelization Plan Generation for Deep Learning Training
With the growing model size of deep neural networks (DNN), deep learning training is
increasingly relying on handcrafted search spaces to find efficient parallelization execution …
increasingly relying on handcrafted search spaces to find efficient parallelization execution …