A Review on edge large language models: Design, Execution, and Applications

Y Zheng, Y Chen, B Qian, X Shi, Y Shu… - ACM Computing …, 2024‏ - dl.acm.org
Large language models (LLMs) have revolutionized natural language processing with their
exceptional understanding, synthesizing, and reasoning capabilities. However, deploying …

Llm inference serving: Survey of recent advances and opportunities

B Li, Y Jiang, V Gadepally, D Tiwari - arxiv preprint arxiv:2407.12391, 2024‏ - arxiv.org
This survey offers a comprehensive overview of recent advancements in Large Language
Model (LLM) serving systems, focusing on research since the year 2023. We specifically …

A survey on efficient inference for large language models

Z Zhou, X Ning, K Hong, T Fu, J Xu, S Li, Y Lou… - arxiv preprint arxiv …, 2024‏ - arxiv.org
Large Language Models (LLMs) have attracted extensive attention due to their remarkable
performance across various tasks. However, the substantial computational and memory …

Loongserve: Efficiently serving long-context large language models with elastic sequence parallelism

B Wu, S Liu, Y Zhong, P Sun, X Liu, X ** - Proceedings of the ACM …, 2024‏ - dl.acm.org
The context window of large language models (LLMs) is rapidly increasing, leading to a
huge variance in resource usage between different requests as well as between different …

Fast distributed inference serving for large language models

B Wu, Y Zhong, Z Zhang, S Liu, F Liu, Y Sun… - arxiv preprint arxiv …, 2023‏ - arxiv.org
Large language models (LLMs) power a new generation of interactive AI applications
exemplified by ChatGPT. The interactive nature of these applications demands low latency …

Parrot: Efficient Serving of {LLM-based} Applications with Semantic Variable

C Lin, Z Han, C Zhang, Y Yang, F Yang… - … USENIX Symposium on …, 2024‏ - usenix.org
The rise of large language models (LLMs) has enabled LLM-based applications (aka AI
agents or co-pilots), a new software paradigm that combines the strength of LLM and …

Vidur: A large-scale simulation framework for llm inference

A Agrawal, N Kedia, J Mohan… - Proceedings of …, 2024‏ - proceedings.mlsys.org
Large language models (LLMs) are widely used in various domains for their ability to
perform tasks that requirehuman-like skills. However, LLM inference is expensive today …

Ragcache: Efficient knowledge caching for retrieval-augmented generation

C **, Z Zhang, X Jiang, F Liu, X Liu, X Liu… - arxiv preprint arxiv …, 2024‏ - arxiv.org
Retrieval-Augmented Generation (RAG) has shown significant improvements in various
natural language processing tasks by integrating the strengths of large language models …

Mooncake: A kvcache-centric disaggregated architecture for llm serving

R Qin, Z Li, W He, M Zhang, Y Wu, W Zheng… - arxiv preprint arxiv …, 2024‏ - arxiv.org
Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI. It
features a KVCache-centric disaggregated architecture that separates the prefill and …

Empowering 1000 tokens/second on-device llm prefilling with mllm-npu

D Xu, H Zhang, L Yang, R Liu, G Huang, M Xu… - arxiv preprint arxiv …, 2024‏ - arxiv.org
On-device large language models (LLMs) are catalyzing novel mobile applications such as
UI task automation and personalized email auto-reply, without giving away users' private …