A survey of large language models

WX Zhao, K Zhou, J Li, T Tang, X Wang, Y Hou… - arxiv preprint arxiv …, 2023‏ - arxiv.org
Language is essentially a complex, intricate system of human expressions governed by
grammatical rules. It poses a significant challenge to develop capable AI algorithms for …

Understanding llms: A comprehensive overview from training to inference

Y Liu, H He, T Han, X Zhang, M Liu, J Tian… - arxiv preprint arxiv …, 2024‏ - arxiv.org
The introduction of ChatGPT has led to a significant increase in the utilization of Large
Language Models (LLMs) for addressing downstream tasks. There's an increasing focus on …

Powerinfer: Fast large language model serving with a consumer-grade gpu

Y Song, Z Mi, H **e, H Chen - Proceedings of the ACM SIGOPS 30th …, 2024‏ - dl.acm.org
This paper introduces PowerInfer, a high-speed Large Language Model (LLM) inference
engine on a personal computer (PC) equipped with a single consumer-grade GPU. The key …

Efficient large language models: A survey

Z Wan, X Wang, C Liu, S Alam, Y Zheng, J Liu… - arxiv preprint arxiv …, 2023‏ - arxiv.org
Large Language Models (LLMs) have demonstrated remarkable capabilities in important
tasks such as natural language understanding and language generation, and thus have the …

Unlocking efficiency in large language model inference: A comprehensive survey of speculative decoding

H **a, Z Yang, Q Dong, P Wang, Y Li, T Ge… - arxiv preprint arxiv …, 2024‏ - arxiv.org
To mitigate the high inference latency stemming from autoregressive decoding in Large
Language Models (LLMs), Speculative Decoding has emerged as a novel decoding …

Fairness in serving large language models

Y Sheng, S Cao, D Li, B Zhu, Z Li, D Zhuo… - … USENIX Symposium on …, 2024‏ - usenix.org
High-demand LLM inference services (eg, ChatGPT and BARD) support a wide range of
requests from short chat conversations to long document reading. To ensure that all client …

[PDF][PDF] Skeleton-of-thought: Large language models can do parallel decoding

X Ning, Z Lin, Z Zhou, Z Wang, H Yang… - Proceedings ENLSP …, 2023‏ - lirias.kuleuven.be
This work aims at decreasing the end-to-end generation latency of large language models
(LLMs). One of the major causes of the high generation latency is the sequential decoding …

{dLoRA}: Dynamically orchestrating requests and adapters for {LoRA}{LLM} serving

B Wu, R Zhu, Z Zhang, P Sun, X Liu, X ** - 18th USENIX Symposium on …, 2024‏ - usenix.org
Low-rank adaptation (LoRA) is a popular approach to finetune pre-trained large language
models (LLMs) to specific domains. This paper introduces dLoRA, an inference serving …

Medusa: Simple llm inference acceleration framework with multiple decoding heads

T Cai, Y Li, Z Geng, H Peng, JD Lee, D Chen… - arxiv preprint arxiv …, 2024‏ - arxiv.org
The inference process in Large Language Models (LLMs) is often limited due to the absence
of parallelism in the auto-regressive decoding process, resulting in most operations being …

Towards efficient generative large language model serving: A survey from algorithms to systems

X Miao, G Oliaro, Z Zhang, X Cheng, H **… - arxiv preprint arxiv …, 2023‏ - arxiv.org
In the rapidly evolving landscape of artificial intelligence (AI), generative large language
models (LLMs) stand at the forefront, revolutionizing how we interact with our data. However …