A survey of large language models
Language is essentially a complex, intricate system of human expressions governed by
grammatical rules. It poses a significant challenge to develop capable AI algorithms for …
grammatical rules. It poses a significant challenge to develop capable AI algorithms for …
Understanding llms: A comprehensive overview from training to inference
The introduction of ChatGPT has led to a significant increase in the utilization of Large
Language Models (LLMs) for addressing downstream tasks. There's an increasing focus on …
Language Models (LLMs) for addressing downstream tasks. There's an increasing focus on …
Powerinfer: Fast large language model serving with a consumer-grade gpu
This paper introduces PowerInfer, a high-speed Large Language Model (LLM) inference
engine on a personal computer (PC) equipped with a single consumer-grade GPU. The key …
engine on a personal computer (PC) equipped with a single consumer-grade GPU. The key …
Efficient large language models: A survey
Large Language Models (LLMs) have demonstrated remarkable capabilities in important
tasks such as natural language understanding and language generation, and thus have the …
tasks such as natural language understanding and language generation, and thus have the …
Unlocking efficiency in large language model inference: A comprehensive survey of speculative decoding
To mitigate the high inference latency stemming from autoregressive decoding in Large
Language Models (LLMs), Speculative Decoding has emerged as a novel decoding …
Language Models (LLMs), Speculative Decoding has emerged as a novel decoding …
Fairness in serving large language models
High-demand LLM inference services (eg, ChatGPT and BARD) support a wide range of
requests from short chat conversations to long document reading. To ensure that all client …
requests from short chat conversations to long document reading. To ensure that all client …
[PDF][PDF] Skeleton-of-thought: Large language models can do parallel decoding
This work aims at decreasing the end-to-end generation latency of large language models
(LLMs). One of the major causes of the high generation latency is the sequential decoding …
(LLMs). One of the major causes of the high generation latency is the sequential decoding …
{dLoRA}: Dynamically orchestrating requests and adapters for {LoRA}{LLM} serving
Low-rank adaptation (LoRA) is a popular approach to finetune pre-trained large language
models (LLMs) to specific domains. This paper introduces dLoRA, an inference serving …
models (LLMs) to specific domains. This paper introduces dLoRA, an inference serving …
Medusa: Simple llm inference acceleration framework with multiple decoding heads
The inference process in Large Language Models (LLMs) is often limited due to the absence
of parallelism in the auto-regressive decoding process, resulting in most operations being …
of parallelism in the auto-regressive decoding process, resulting in most operations being …
Towards efficient generative large language model serving: A survey from algorithms to systems
In the rapidly evolving landscape of artificial intelligence (AI), generative large language
models (LLMs) stand at the forefront, revolutionizing how we interact with our data. However …
models (LLMs) stand at the forefront, revolutionizing how we interact with our data. However …