- Academic Search

B Li, Y Jiang, V Gadepally, D Tiwari - arxiv preprint arxiv:2407.12391, 2024 - arxiv.org

This survey offers a comprehensive overview of recent advancements in Large Language
Model (LLM) serving systems, focusing on research since the year 2023. We specifically …

Opslaan Citeren Geciteerd door 14 Verwante artikelen Alle 3 versies HTML-versie

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Llm inference unveiled: Survey and roofline model insights

Z Yuan, Y Shang, Y Zhou, Z Dong, Z Zhou… - arxiv preprint arxiv …, 2024 - arxiv.org

The field of efficient Large Language Model (LLM) inference is rapidly evolving, presenting a
unique blend of opportunities and challenges. Although the field has expanded and is …

Opslaan Citeren Geciteerd door 57 Verwante artikelen Alle 2 versies HTML-versie

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Loongserve: Efficiently serving long-context large language models with elastic sequence parallelism

B Wu, S Liu, Y Zhong, P Sun, X Liu, X ** - Proceedings of the ACM …, 2024 - dl.acm.org

The context window of large language models (LLMs) is rapidly increasing, leading to a
huge variance in resource usage between different requests as well as between different …

Opslaan Citeren Geciteerd door 19 Verwante artikelen Alle 2 versies

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

A survey of low-bit large language models: Basics, systems, and algorithms

R Gong, Y Ding, Z Wang, C Lv, X Zheng, J Du… - arxiv preprint arxiv …, 2024 - arxiv.org

Large language models (LLMs) have achieved remarkable advancements in natural
language processing, showcasing exceptional performance across various tasks. However …

Opslaan Citeren Geciteerd door 2 Verwante artikelen Alle 3 versies HTML-versie

[Free GPT-4]
[DeepSeek]

[PDF] usenix.org

{USHER}: Holistic Interference Avoidance for Resource Optimized {ML} Inference

SS Shubha, H Shen, A Iyer - 18th USENIX Symposium on Operating …, 2024 - usenix.org

Minimizing monetary cost and maximizing the goodput of inference serving systems are
increasingly important with the ever-increasing popularity of deep learning models. While it …

Opslaan Citeren Geciteerd door 3 Verwante artikelen HTML-versie

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Nanoflow: Towards optimal large language model serving throughput

K Zhu, Y Zhao, L Zhao, G Zuo, Y Gu, D **e… - arxiv preprint arxiv …, 2024 - arxiv.org

The increasing usage of Large Language Models (LLMs) has resulted in a surging demand
for planet-scale serving systems, where tens of thousands of GPUs continuously serve …

Opslaan Citeren Geciteerd door 10 Verwante artikelen Alle 3 versies HTML-versie

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Instinfer: In-storage attention offloading for cost-effective long-context llm inference

X Pan, E Li, Q Li, S Liang, Y Shan, K Zhou… - arxiv preprint arxiv …, 2024 - arxiv.org

The widespread of Large Language Models (LLMs) marks a significant milestone in
generative AI. Nevertheless, the increasing context length and batch size in offline LLM …

Opslaan Citeren Geciteerd door 9 Verwante artikelen Alle 3 versies HTML-versie

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

A survey on efficient inference for large language models

Z Zhou, X Ning, K Hong, T Fu, J Xu, S Li, Y Lou… - arxiv preprint arxiv …, 2024 - arxiv.org

Large Language Models (LLMs) have attracted extensive attention due to their remarkable
performance across various tasks. However, the substantial computational and memory …

Opslaan Citeren Geciteerd door 75 Verwante artikelen Alle 5 versies HTML-versie

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Stateful large language model serving with pensieve

L Yu, J Lin, J Li - arxiv preprint arxiv:2312.05516, 2023 - arxiv.org

Large Language Models (LLMs) are wildly popular today and it is important to serve them
efficiently. Existing LLM serving systems are stateless across requests. Consequently, when …

Opslaan Citeren Geciteerd door 12 Verwante artikelen Alle 2 versies HTML-versie

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Fast state restoration in LLM serving with hcache

S Gao, Y Chen, J Shu - arxiv preprint arxiv:2410.05004, 2024 - arxiv.org

The growing complexity of LLM usage today, eg, multi-round conversation and retrieval-
augmented generation (RAG), makes contextual states (ie, KV cache) reusable across user …

Opslaan Citeren Geciteerd door 5 Verwante artikelen Alle 3 versies HTML-versie

Melding maken

Citeren

Geavanceerd zoeken

Opgeslagen in Mijn bibliotheek

Deepspeed-fastgen: High-throughput text generation for llms via mii and deepspeed-inference

Llm inference serving: Survey of recent advances and opportunities

Llm inference unveiled: Survey and roofline model insights

Loongserve: Efficiently serving long-context large language models with elastic sequence parallelism

A survey of low-bit large language models: Basics, systems, and algorithms

{USHER}: Holistic Interference Avoidance for Resource Optimized {ML} Inference

Nanoflow: Towards optimal large language model serving throughput

Instinfer: In-storage attention offloading for cost-effective long-context llm inference

A survey on efficient inference for large language models

Stateful large language model serving with pensieve

Fast state restoration in LLM serving with hcache