- Academic Search

G Qu, Q Chen, W Wei, Z Lin, X Chen… - … Surveys & Tutorials, 2025 - ieeexplore.ieee.org

On-device large language models (LLMs), referring to running LLMs on edge devices, have
raised considerable interest since they are more cost-effective, latency-efficient, and privacy …

Save Cite Cited by 26 Related articles All 4 versions Free GPT-4

[Free GPT-4]

[PDF] arxiv.org

Efficient large language models: A survey

Z Wan, X Wang, C Liu, S Alam, Y Zheng, J Liu… - arxiv preprint arxiv …, 2023 - arxiv.org

Large Language Models (LLMs) have demonstrated remarkable capabilities in important
tasks such as natural language understanding and language generation, and thus have the …

Save Cite Cited by 125 Related articles All 7 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] usenix.org

{InfiniGen}: Efficient generative inference of large language models with dynamic {KV} cache management

W Lee, J Lee, J Seo, J Sim - 18th USENIX Symposium on Operating …, 2024 - usenix.org

Transformer-based large language models (LLMs) demonstrate impressive performance
across various natural language processing tasks. Serving LLM inference for generating …

Save Cite Cited by 36 Related articles View as HTML

[Free GPT-4]

[PDF] usenix.org

{Cost-Efficient} large language model serving for multi-turn conversations with {CachedAttention}

B Gao, Z He, P Sharma, Q Kang, D Jevdjic… - 2024 USENIX Annual …, 2024 - usenix.org

Interacting with humans through multi-turn conversations is a fundamental feature of large
language models (LLMs). However, existing LLM serving engines executing multi-turn …

Save Cite Cited by 17 Related articles All 2 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] nsf.gov

[PDF][PDF] Efficiently Programming Large Language Models using SGLang.

L Zheng, L Yin, Z **e, J Huang, C Sun, CH Yu, S Cao… - 2023 - par.nsf.gov

Large language models (LLMs) are increasingly used for complex tasks that require multiple
generation calls, advanced prompting techniques, control flow, and structured …

Save Cite Cited by 67 Related articles All 2 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] arxiv.org

Towards efficient generative large language model serving: A survey from algorithms to systems

X Miao, G Oliaro, Z Zhang, X Cheng, H **… - arxiv preprint arxiv …, 2023 - arxiv.org

In the rapidly evolving landscape of artificial intelligence (AI), generative large language
models (LLMs) stand at the forefront, revolutionizing how we interact with our data. However …

Save Cite Cited by 70 Related articles All 2 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] arxiv.org

Minference 1.0: Accelerating pre-filling for long-context llms via dynamic sparse attention

H Jiang, Y Li, C Zhang, Q Wu, X Luo, S Ahn… - arxiv preprint arxiv …, 2024 - arxiv.org

The computational challenges of Large Language Model (LLM) inference remain a
significant barrier to their widespread deployment, especially as prompt lengths continue to …

Save Cite Cited by 28 Related articles All 3 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] arxiv.org

Transformers are multi-state rnns

M Oren, M Hassid, N Yarden, Y Adi… - arxiv preprint arxiv …, 2024 - arxiv.org

Transformers are considered conceptually different from the previous generation of state-of-
the-art NLP models-recurrent neural networks (RNNs). In this work, we demonstrate that …

Save Cite Cited by 35 Related articles All 2 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] arxiv.org

Personal llm agents: Insights and survey about the capability, efficiency and security

Y Li, H Wen, W Wang, X Li, Y Yuan, G Liu, J Liu… - arxiv preprint arxiv …, 2024 - arxiv.org

Since the advent of personal computing devices, intelligent personal assistants (IPAs) have
been one of the key technologies that researchers and engineers have focused on, aiming …

Save Cite Cited by 116 Related articles All 3 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] github.io

[PDF][PDF] Sglang: Efficient execution of structured language model programs

L Zheng, L Yin, Z **e, C Sun, J Huang… - arxiv preprint arxiv …, 2024 - minjiazhang.github.io

Large language models (LLMs) are increasingly used for complex tasks that require multiple
generation calls, advanced prompting techniques, control flow, and structured …

Save Cite Cited by 21 Related articles View as HTML

Create alert

Cite

Advanced search

Saved to My library

Scissorhands: Exploiting the persistence of importance hypothesis for llm kv cache compression...

Mobile edge intelligence for large language models: A contemporary survey

Efficient large language models: A survey

{InfiniGen}: Efficient generative inference of large language models with dynamic {KV} cache management

{Cost-Efficient} large language model serving for multi-turn conversations with {CachedAttention}

[PDF][PDF] Efficiently Programming Large Language Models using SGLang.

Towards efficient generative large language model serving: A survey from algorithms to systems

Minference 1.0: Accelerating pre-filling for long-context llms via dynamic sparse attention

Transformers are multi-state rnns

Personal llm agents: Insights and survey about the capability, efficiency and security

[PDF][PDF] Sglang: Efficient execution of structured language model programs