Mobile edge intelligence for large language models: A contemporary survey

G Qu, Q Chen, W Wei, Z Lin, X Chen… - … Surveys & Tutorials, 2025 - ieeexplore.ieee.org
On-device large language models (LLMs), referring to running LLMs on edge devices, have
raised considerable interest since they are more cost-effective, latency-efficient, and privacy …

Tool learning with large language models: A survey

C Qu, S Dai, X Wei, H Cai, S Wang, D Yin, J Xu… - Frontiers of Computer …, 2025 - Springer
Recently, tool learning with large language models (LLMs) has emerged as a promising
paradigm for augmenting the capabilities of LLMs to tackle highly complex problems …

SpecInfer: Accelerating Generative Large Language Model Serving with Tree-based Speculative Inference and Verification

X Miao, G Oliaro, Z Zhang, X Cheng, Z Wang… - arxiv preprint arxiv …, 2023 - arxiv.org
This paper introduces SpecInfer, a system that accelerates generative large language model
(LLM) serving with tree-based speculative inference and verification. The key idea behind …

Spotserve: Serving generative large language models on preemptible instances

X Miao, C Shi, J Duan, X **, D Lin, B Cui… - Proceedings of the 29th …, 2024 - dl.acm.org
The high computational and memory requirements of generative large language models
(LLMs) make it challenging to serve them cheaply. This paper aims to reduce the monetary …

Efficient and green large language models for software engineering: Vision and the road ahead

J Shi, Z Yang, D Lo - ACM Transactions on Software Engineering and …, 2024 - dl.acm.org
Large Language Models (LLMs) have recently shown remarkable capabilities in various
software engineering tasks, spurring the rapid growth of the Large Language Models for …

Break the sequential dependency of llm inference using lookahead decoding

Y Fu, P Bailis, I Stoica, H Zhang - arxiv preprint arxiv:2402.02057, 2024 - arxiv.org
Autoregressive decoding of large language models (LLMs) is memory bandwidth bounded,
resulting in high latency and significant wastes of the parallel processing power of modern …

From decoding to meta-generation: Inference-time algorithms for large language models

S Welleck, A Bertsch, M Finlayson… - arxiv preprint arxiv …, 2024 - arxiv.org
One of the most striking findings in modern research on large language models (LLMs) is
that scaling up compute during training leads to better results. However, less attention has …

Large language models and games: A survey and roadmap

R Gallotta, G Todd, M Zammit, S Earle, A Liapis… - arxiv preprint arxiv …, 2024 - arxiv.org
Recent years have seen an explosive increase in research on large language models
(LLMs), and accompanying public engagement on the topic. While starting as a niche area …

Llm inference serving: Survey of recent advances and opportunities

B Li, Y Jiang, V Gadepally, D Tiwari - arxiv preprint arxiv:2407.12391, 2024 - arxiv.org
This survey offers a comprehensive overview of recent advancements in Large Language
Model (LLM) serving systems, focusing on research since the year 2023. We specifically …

Specinfer: Accelerating large language model serving with tree-based speculative inference and verification

X Miao, G Oliaro, Z Zhang, X Cheng, Z Wang… - Proceedings of the 29th …, 2024 - dl.acm.org
This paper introduces SpecInfer, a system that accelerates generative large language model
(LLM) serving with tree-based speculative inference and verification. The key idea behind …