Mobile edge intelligence for large language models: A contemporary survey
On-device large language models (LLMs), referring to running LLMs on edge devices, have
raised considerable interest since they are more cost-effective, latency-efficient, and privacy …
raised considerable interest since they are more cost-effective, latency-efficient, and privacy …
Efficient large language models: A survey
Large Language Models (LLMs) have demonstrated remarkable capabilities in important
tasks such as natural language understanding and language generation, and thus have the …
tasks such as natural language understanding and language generation, and thus have the …
{InfiniGen}: Efficient generative inference of large language models with dynamic {KV} cache management
Transformer-based large language models (LLMs) demonstrate impressive performance
across various natural language processing tasks. Serving LLM inference for generating …
across various natural language processing tasks. Serving LLM inference for generating …
{Cost-Efficient} large language model serving for multi-turn conversations with {CachedAttention}
Interacting with humans through multi-turn conversations is a fundamental feature of large
language models (LLMs). However, existing LLM serving engines executing multi-turn …
language models (LLMs). However, existing LLM serving engines executing multi-turn …
[PDF][PDF] Efficiently Programming Large Language Models using SGLang.
Large language models (LLMs) are increasingly used for complex tasks that require multiple
generation calls, advanced prompting techniques, control flow, and structured …
generation calls, advanced prompting techniques, control flow, and structured …
Towards efficient generative large language model serving: A survey from algorithms to systems
In the rapidly evolving landscape of artificial intelligence (AI), generative large language
models (LLMs) stand at the forefront, revolutionizing how we interact with our data. However …
models (LLMs) stand at the forefront, revolutionizing how we interact with our data. However …
Minference 1.0: Accelerating pre-filling for long-context llms via dynamic sparse attention
The computational challenges of Large Language Model (LLM) inference remain a
significant barrier to their widespread deployment, especially as prompt lengths continue to …
significant barrier to their widespread deployment, especially as prompt lengths continue to …
Transformers are multi-state rnns
Transformers are considered conceptually different from the previous generation of state-of-
the-art NLP models-recurrent neural networks (RNNs). In this work, we demonstrate that …
the-art NLP models-recurrent neural networks (RNNs). In this work, we demonstrate that …
Personal llm agents: Insights and survey about the capability, efficiency and security
Since the advent of personal computing devices, intelligent personal assistants (IPAs) have
been one of the key technologies that researchers and engineers have focused on, aiming …
been one of the key technologies that researchers and engineers have focused on, aiming …
[PDF][PDF] Sglang: Efficient execution of structured language model programs
Large language models (LLMs) are increasingly used for complex tasks that require multiple
generation calls, advanced prompting techniques, control flow, and structured …
generation calls, advanced prompting techniques, control flow, and structured …