- Academic Search

D Liu, M Chen, B Lu, H Jiang, Z Han, Q Zhang… - arxiv preprint arxiv …, 2024 - arxiv.org

Transformer-based Large Language Models (LLMs) have become increasingly important.
However, due to the quadratic time complexity of attention computation, scaling LLMs to …

Uložit Citovat Počet citací tohoto článku: 14 Související články Všechny verze (počet: 3) Zobrazit jako HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Neo: Saving gpu memory crisis with cpu offloading for online llm inference

X Jiang, Y Zhou, S Cao, I Stoica, M Yu - arxiv preprint arxiv:2411.01142, 2024 - arxiv.org

Online LLM inference powers many exciting applications such as intelligent chatbots and
autonomous agents. Modern LLM inference engines widely rely on request batching to …

Uložit Citovat Počet citací tohoto článku: 4 Související články Všechny verze (počet: 2) Zobrazit jako HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Expertflow: Optimized expert activation and token allocation for efficient mixture-of-experts inference

X He, S Zhang, Y Wang, H Yin, Z Zeng, S Shi… - arxiv preprint arxiv …, 2024 - arxiv.org

Sparse Mixture of Experts (MoE) models, while outperforming dense Large Language
Models (LLMs) in terms of performance, face significant deployment challenges during …

Uložit Citovat Počet citací tohoto článku: 2 Související články Všechny verze (počet: 2) Zobrazit jako HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

DeepFlow: Serverless Large Language Model Serving at Scale

J Hu, J Xu, Z Liu, Y He, Y Chen, H Xu, J Liu… - arxiv preprint arxiv …, 2025 - arxiv.org

This paper introduces DeepFlow, a scalable and serverless AI platform designed to
efficiently serve large language models (LLMs) at scale in cloud environments. DeepFlow …

Uložit Citovat Související články Všechny verze (počet: 2) Zobrazit jako HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

iServe: An Intent-based Serving System for LLMs

D Liakopoulos, T Hu, P Sinha… - arxiv preprint arxiv …, 2025 - arxiv.org

Large Language Models (LLMs) are becoming ubiquitous across industries, where
applications demand they fulfill diverse user intents. However, developers currently face the …

Uložit Citovat Související články Všechny verze (počet: 2) Zobrazit jako HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

KunServe: Elastic and Efficient Large Language Model Serving with Parameter-centric Memory Management

R Cheng, Y Peng, Y Lai, X Wei, R Chen… - arxiv preprint arxiv …, 2024 - arxiv.org

The stateful nature of large language model (LLM) servingcan easily throttle precious GPU
memory under load burstor long-generation requests like chain-of-thought reasoning …

Uložit Citovat Související články Všechny verze (počet: 2) Zobrazit jako HTML

[CITACE][C] Empowering Large Language Models to Edge Intelligence: A Survey of Edge Efficient LLMs and Techniques

R Wanga, Z Gaoa, L Zhanga, S Yuea, Z Gaoa

Uložit Citovat Související články Všechny verze (počet: 2)

Vytvořit upozornění

Citovat

Rozšířené vyhledávání

Uloženo do Mojí knihovny

Efficient and economic large language model inference with attention offloading

Retrievalattention: Accelerating long-context llm inference via vector retrieval

Neo: Saving gpu memory crisis with cpu offloading for online llm inference

Expertflow: Optimized expert activation and token allocation for efficient mixture-of-experts inference

DeepFlow: Serverless Large Language Model Serving at Scale

iServe: An Intent-based Serving System for LLMs

KunServe: Elastic and Efficient Large Language Model Serving with Parameter-centric Memory Management

[CITACE][C] Empowering Large Language Models to Edge Intelligence: A Survey of Edge Efficient LLMs and Techniques