On-device language models: A comprehensive review

J Xu, Z Li, W Chen, Q Wang, X Gao, Q Cai… - arxiv preprint arxiv …, 2024 - arxiv.org
The advent of large language models (LLMs) revolutionized natural language processing
applications, and running LLMs on edge devices has become increasingly attractive for …

[PDF][PDF] A comprehensive survey of small language models in the era of large language models: Techniques, enhancements, applications, collaboration with llms, and …

F Wang, Z Zhang, X Zhang, Z Wu, T Mo, Q Lu… - arxiv preprint arxiv …, 2024 - ai.radensa.ru
Large language models (LLM) have demonstrated emergent abilities in text generation,
question answering, and reasoning, facilitating various tasks and domains. Despite their …

D-llm: A token adaptive computing resource allocation strategy for large language models

Y Jiang, H Wang, L **e, H Zhao… - Advances in Neural …, 2025 - proceedings.neurips.cc
Large language models have shown an impressive societal impact owing to their excellent
understanding and logical reasoning skills. However, such strong ability relies on a huge …

Instinfer: In-storage attention offloading for cost-effective long-context llm inference

X Pan, E Li, Q Li, S Liang, Y Shan, K Zhou… - arxiv preprint arxiv …, 2024 - arxiv.org
The widespread of Large Language Models (LLMs) marks a significant milestone in
generative AI. Nevertheless, the increasing context length and batch size in offline LLM …

Wip: Efficient llm prefilling with mobile npu

D Xu, H Zhang, L Yang, R Liu, M Xu, X Liu - Proceedings of the …, 2024 - dl.acm.org
Large language models (LLMs) play a crucial role in various Natural Language Processing
(NLP) tasks, prompting their deployment on mobile devices for inference. However, a …

HexGen-2: Disaggregated Generative Inference of LLMs in Heterogeneous Environment

Y Jiang, R Yan, B Yuan - arxiv preprint arxiv:2502.07903, 2025 - arxiv.org
Disaggregating the prefill and decoding phases represents an effective new paradigm for
generative inference of large language models (LLM), which eliminates prefill-decoding …

Hybrid Heterogeneous Clusters Can Lower the Energy Consumption of LLM Inference Workloads

G Wilkins, S Keshav, R Mortier - Proceedings of the 15th ACM …, 2024 - dl.acm.org
Both the training and use of Large Language Models (LLMs) require large amounts of
energy. Their increasing popularity, therefore, raises critical concerns regarding the energy …

Hybrid Heterogeneous Clusters Can Lower the Energy Consumption of LLM Inference Workloads

R Mortier, S Keshav, G Wilkins - 2024 - repository.cam.ac.uk
Both the training and use of Large Language Models (LLMs) require large amounts of
energy. Their increasing popularity, therefore, raises critical concerns regarding the energy …

[PDF][PDF] Online Workload Allocation and Energy Optimization in Large Language Model Inference Systems

G Wilkins - 2024 - grantwilkins.github.io
The rapid adoption of Large Language Models (LLMs) has furthered natural language
processing and helped text generation, question answering, and sentiment analysis …

Smart QoS-Aware Resource Management For Edge Intelligence Systems

M Hosseinzadeh - uknowledge.uky.edu
There are several definitions for Smart Cities. One common key point of these definitions is
that smart cities are technologically advanced cities which connect everything in a complex …