Llm inference serving: Survey of recent advances and opportunities
This survey offers a comprehensive overview of recent advancements in Large Language
Model (LLM) serving systems, focusing on research since the year 2023. We specifically …
Model (LLM) serving systems, focusing on research since the year 2023. We specifically …
Llm inference unveiled: Survey and roofline model insights
The field of efficient Large Language Model (LLM) inference is rapidly evolving, presenting a
unique blend of opportunities and challenges. Although the field has expanded and is …
unique blend of opportunities and challenges. Although the field has expanded and is …
Loongserve: Efficiently serving long-context large language models with elastic sequence parallelism
The context window of large language models (LLMs) is rapidly increasing, leading to a
huge variance in resource usage between different requests as well as between different …
huge variance in resource usage between different requests as well as between different …
A survey of low-bit large language models: Basics, systems, and algorithms
Large language models (LLMs) have achieved remarkable advancements in natural
language processing, showcasing exceptional performance across various tasks. However …
language processing, showcasing exceptional performance across various tasks. However …
{USHER}: Holistic Interference Avoidance for Resource Optimized {ML} Inference
Minimizing monetary cost and maximizing the goodput of inference serving systems are
increasingly important with the ever-increasing popularity of deep learning models. While it …
increasingly important with the ever-increasing popularity of deep learning models. While it …
Nanoflow: Towards optimal large language model serving throughput
The increasing usage of Large Language Models (LLMs) has resulted in a surging demand
for planet-scale serving systems, where tens of thousands of GPUs continuously serve …
for planet-scale serving systems, where tens of thousands of GPUs continuously serve …
Instinfer: In-storage attention offloading for cost-effective long-context llm inference
The widespread of Large Language Models (LLMs) marks a significant milestone in
generative AI. Nevertheless, the increasing context length and batch size in offline LLM …
generative AI. Nevertheless, the increasing context length and batch size in offline LLM …
A survey on efficient inference for large language models
Large Language Models (LLMs) have attracted extensive attention due to their remarkable
performance across various tasks. However, the substantial computational and memory …
performance across various tasks. However, the substantial computational and memory …
Stateful large language model serving with pensieve
L Yu, J Lin, J Li - arxiv preprint arxiv:2312.05516, 2023 - arxiv.org
Large Language Models (LLMs) are wildly popular today and it is important to serve them
efficiently. Existing LLM serving systems are stateless across requests. Consequently, when …
efficiently. Existing LLM serving systems are stateless across requests. Consequently, when …
Fast state restoration in LLM serving with hcache
The growing complexity of LLM usage today, eg, multi-round conversation and retrieval-
augmented generation (RAG), makes contextual states (ie, KV cache) reusable across user …
augmented generation (RAG), makes contextual states (ie, KV cache) reusable across user …