Splitwise: Efficient generative llm inference using phase splitting

P Patel, E Choukse, C Zhang, A Shah… - 2024 ACM/IEEE 51st …, 2024 - ieeexplore.ieee.org
Generative large language model (LLM) applications are growing rapidly, leading to large-
scale deployments of expensive and power-hungry GPUs. Our characterization of LLM …

Sglang: Efficient execution of structured language model programs

L Zheng, L Yin, Z **e, CL Sun… - Advances in …, 2025 - proceedings.neurips.cc
Large language models (LLMs) are increasingly used for complex tasks that require multiple
generation calls, advanced prompting techniques, control flow, and structured …

Llm inference serving: Survey of recent advances and opportunities

B Li, Y Jiang, V Gadepally, D Tiwari - arxiv preprint arxiv:2407.12391, 2024 - arxiv.org
This survey offers a comprehensive overview of recent advancements in Large Language
Model (LLM) serving systems, focusing on research since the year 2023. We specifically …

[HTML][HTML] Large language models meet next-generation networking technologies: A review

CN Hang, PD Yu, R Morabito, CW Tan - Future Internet, 2024 - mdpi.com
The evolution of network technologies has significantly transformed global communication,
information sharing, and connectivity. Traditional networks, relying on static configurations …

Fast distributed inference serving for large language models

B Wu, Y Zhong, Z Zhang, S Liu, F Liu, Y Sun… - arxiv preprint arxiv …, 2023 - arxiv.org
Large language models (LLMs) power a new generation of interactive AI applications
exemplified by ChatGPT. The interactive nature of these applications demands low latency …

[PDF][PDF] Efficiently Programming Large Language Models using SGLang.

L Zheng, L Yin, Z **e, J Huang, C Sun, CH Yu, S Cao… - 2023 - par.nsf.gov
Large language models (LLMs) are increasingly used for complex tasks that require multiple
generation calls, advanced prompting techniques, control flow, and structured …

Andes: Defining and enhancing quality-of-experience in llm-based text streaming services

J Liu, JW Chung, Z Wu, F Lai, M Lee… - arxiv preprint arxiv …, 2024 - arxiv.org
Large language models (LLMs) are now at the core of conversational AI services such as
real-time translation and chatbots, which provide live user interaction by incrementally …

A survey on efficient inference for large language models

Z Zhou, X Ning, K Hong, T Fu, J Xu, S Li, Y Lou… - arxiv preprint arxiv …, 2024 - arxiv.org
Large Language Models (LLMs) have attracted extensive attention due to their remarkable
performance across various tasks. However, the substantial computational and memory …

Inference without interference: Disaggregate llm inference for mixed downstream workloads

C Hu, H Huang, L Xu, X Chen, J Xu, S Chen… - arxiv preprint arxiv …, 2024 - arxiv.org
Transformer-based large language model (LLM) inference serving is now the backbone of
many cloud services. LLM inference consists of a prefill phase and a decode phase …

Taming throughput-latency tradeoff in llm inference with sarathi-serve

A Agrawal, N Kedia, A Panwar, J Mohan… - arxiv preprint arxiv …, 2024 - arxiv.org
Each LLM serving request goes through two phases. The first is prefill which processes the
entire input prompt to produce one output token and the second is decode which generates …