Power-aware Deep Learning Model Serving with {μ-Serve}

H Qiu, W Mao, A Patke, S Cui, S Jha, C Wang… - 2024 USENIX Annual …, 2024 - usenix.org
With the increasing popularity of large deep learning model-serving workloads, there is a
pressing need to reduce the energy consumption of a model-serving cluster while …

Dynamollm: Designing llm inference clusters for performance and energy efficiency

J Stojkovic, C Zhang, Í Goiri, J Torrellas… - arxiv preprint arxiv …, 2024 - arxiv.org
The rapid evolution and widespread adoption of generative large language models (LLMs)
have made them a pivotal workload in various applications. Today, LLM inference clusters …

Queue Management for SLO-Oriented Large Language Model Serving

A Patke, D Reddy, S Jha, H Qiu, C Pinto… - Proceedings of the …, 2024 - dl.acm.org
Large language model (LLM) serving is becoming an increasingly critical workload for cloud
providers. Existing LLM serving systems focus on interactive requests, such as chatbots and …

Concise thoughts: Impact of output length on llm reasoning and cost

S Nayab, G Rossolini, M Simoni, A Saracino… - arxiv preprint arxiv …, 2024 - arxiv.org
Today's large language models (LLMs) can solve challenging question-answering tasks,
and prompt engineering techniques, such as chain-of-thought (CoT), have gained attention …

TAPAS: Thermal-and Power-Aware Scheduling for LLM Inference in Cloud Platforms

J Stojkovic, C Zhang, Í Goiri, E Choukse, H Qiu… - arxiv preprint arxiv …, 2025 - arxiv.org
The rising demand for generative large language models (LLMs) poses challenges for
thermal and power management in cloud datacenters. Traditional techniques often are …

Layerkv: Optimizing large language model serving with layer-wise kv cache management

Y **ong, H Wu, C Shao, Z Wang, R Zhang… - arxiv preprint arxiv …, 2024 - arxiv.org
The expanding context windows in large language models (LLMs) have greatly enhanced
their capabilities in various applications, but they also introduce significant challenges in …

Intelligent router for llm workloads: Improving performance through workload-aware scheduling

K Jain, A Parayil, A Mallick, E Choukse, X Qin… - arxiv preprint arxiv …, 2024 - arxiv.org
Large Language Model (LLM) workloads have distinct prefill and decode phases with
different compute and memory requirements which should ideally be accounted for when …

Don't Stop Me Now: Embedding Based Scheduling for LLMs

R Shahout, E Malach, C Liu, W Jiang, M Yu… - arxiv preprint arxiv …, 2024 - arxiv.org
Efficient scheduling is crucial for interactive Large Language Model (LLM) applications,
where low request completion time directly impacts user engagement. Size-based …

Fast inference for augmented large language models

R Shahout, C Liang, S **n, Q Lao, Y Cui, M Yu… - arxiv preprint arxiv …, 2024 - arxiv.org
Augmented Large Language Models (LLMs) enhance the capabilities of standalone LLMs
by integrating external data sources through API calls. In interactive LLM applications …

Multi-Bin Batching for Increasing LLM Inference Throughput

O Guldogan, J Kunde, K Lee, R Pedarsani - arxiv preprint arxiv …, 2024 - arxiv.org
As large language models (LLMs) grow in popularity for their diverse capabilities, improving
the efficiency of their inference systems has become increasingly critical. Batching LLM …