On-device language models: A comprehensive review
The advent of large language models (LLMs) revolutionized natural language processing
applications, and running LLMs on edge devices has become increasingly attractive for …
applications, and running LLMs on edge devices has become increasingly attractive for …
Offline energy-optimal llm serving: Workload-based energy models for llm inference on heterogeneous systems
The rapid adoption of large language models (LLMs) has led to significant advances in
natural language processing and text generation. However, the energy consumed through …
natural language processing and text generation. However, the energy consumed through …
Queue Management for SLO-Oriented Large Language Model Serving
Large language model (LLM) serving is becoming an increasingly critical workload for cloud
providers. Existing LLM serving systems focus on interactive requests, such as chatbots and …
providers. Existing LLM serving systems focus on interactive requests, such as chatbots and …
Deferred continuous batching in resource-efficient large language model serving
Despite that prior work of batched inference and parameter-efficient fine-tuning techniques
have reduced the resource requirements of large language models (LLMs), challenges …
have reduced the resource requirements of large language models (LLMs), challenges …
Watermarking Large Language Models and the Generated Content: Opportunities and Challenges
R Zhang, F Koushanfar - arxiv preprint arxiv:2410.19096, 2024 - arxiv.org
The widely adopted and powerful generative large language models (LLMs) have raised
concerns about intellectual property rights violations and the spread of machine-generated …
concerns about intellectual property rights violations and the spread of machine-generated …
On the Cost of Model-Serving Frameworks: An Experimental Evaluation
In machine learning (ML), the inference phase is the process of applying pre-trained models
to new, unseen data with the objective of making predictions. During the inference phase …
to new, unseen data with the objective of making predictions. During the inference phase …
Efficient LLM Scheduling by Learning to Rank
In Large Language Model (LLM) inference, the output length of an LLM request is typically
regarded as not known a priori. Consequently, most LLM serving systems employ a simple …
regarded as not known a priori. Consequently, most LLM serving systems employ a simple …
Software Performance Engineering for Foundation Model-Powered Software (FMware)
H Zhang, S Chang, A Leung, K Thangarajah… - arxiv preprint arxiv …, 2024 - arxiv.org
The rise of Foundation Models (FMs) like Large Language Models (LLMs) is revolutionizing
software development. Despite the impressive prototypes, transforming FMware into …
software development. Despite the impressive prototypes, transforming FMware into …
IMI: In-memory Multi-job Inference Acceleration for Large Language Models
Large Language Models (LLMs) are increasingly used in various applications but are
computationally complex and energy-consuming due to the high volume of off-chip memory …
computationally complex and energy-consuming due to the high volume of off-chip memory …
Hybrid Heterogeneous Clusters Can Lower the Energy Consumption of LLM Inference Workloads
Both the training and use of Large Language Models (LLMs) require large amounts of
energy. Their increasing popularity, therefore, raises critical concerns regarding the energy …
energy. Their increasing popularity, therefore, raises critical concerns regarding the energy …