On-device language models: A comprehensive review

J Xu, Z Li, W Chen, Q Wang, X Gao, Q Cai… - arxiv preprint arxiv …, 2024 - arxiv.org
The advent of large language models (LLMs) revolutionized natural language processing
applications, and running LLMs on edge devices has become increasingly attractive for …

Offline energy-optimal llm serving: Workload-based energy models for llm inference on heterogeneous systems

G Wilkins, S Keshav, R Mortier - arxiv preprint arxiv:2407.04014, 2024 - arxiv.org
The rapid adoption of large language models (LLMs) has led to significant advances in
natural language processing and text generation. However, the energy consumed through …

Queue Management for SLO-Oriented Large Language Model Serving

A Patke, D Reddy, S Jha, H Qiu, C Pinto… - Proceedings of the …, 2024 - dl.acm.org
Large language model (LLM) serving is becoming an increasingly critical workload for cloud
providers. Existing LLM serving systems focus on interactive requests, such as chatbots and …

Deferred continuous batching in resource-efficient large language model serving

Y He, Y Lu, G Alonso - Proceedings of the 4th Workshop on Machine …, 2024 - dl.acm.org
Despite that prior work of batched inference and parameter-efficient fine-tuning techniques
have reduced the resource requirements of large language models (LLMs), challenges …

Watermarking Large Language Models and the Generated Content: Opportunities and Challenges

R Zhang, F Koushanfar - arxiv preprint arxiv:2410.19096, 2024 - arxiv.org
The widely adopted and powerful generative large language models (LLMs) have raised
concerns about intellectual property rights violations and the spread of machine-generated …

On the Cost of Model-Serving Frameworks: An Experimental Evaluation

P De Rosa, YD Bromberg, P Felber… - 2024 IEEE …, 2024 - ieeexplore.ieee.org
In machine learning (ML), the inference phase is the process of applying pre-trained models
to new, unseen data with the objective of making predictions. During the inference phase …

Efficient LLM Scheduling by Learning to Rank

Y Fu, S Zhu, R Su, A Qiao, I Stoica, H Zhang - arxiv preprint arxiv …, 2024 - arxiv.org
In Large Language Model (LLM) inference, the output length of an LLM request is typically
regarded as not known a priori. Consequently, most LLM serving systems employ a simple …

Software Performance Engineering for Foundation Model-Powered Software (FMware)

H Zhang, S Chang, A Leung, K Thangarajah… - arxiv preprint arxiv …, 2024 - arxiv.org
The rise of Foundation Models (FMs) like Large Language Models (LLMs) is revolutionizing
software development. Despite the impressive prototypes, transforming FMware into …

IMI: In-memory Multi-job Inference Acceleration for Large Language Models

B Gao, Z Wang, Z He, T Luo, WF Wong… - Proceedings of the 53rd …, 2024 - dl.acm.org
Large Language Models (LLMs) are increasingly used in various applications but are
computationally complex and energy-consuming due to the high volume of off-chip memory …

Hybrid Heterogeneous Clusters Can Lower the Energy Consumption of LLM Inference Workloads

G Wilkins, S Keshav, R Mortier - Proceedings of the 15th ACM …, 2024 - dl.acm.org
Both the training and use of Large Language Models (LLMs) require large amounts of
energy. Their increasing popularity, therefore, raises critical concerns regarding the energy …