Power-aware Deep Learning Model Serving with {μ-Serve}
With the increasing popularity of large deep learning model-serving workloads, there is a
pressing need to reduce the energy consumption of a model-serving cluster while …
pressing need to reduce the energy consumption of a model-serving cluster while …
Dynamollm: Designing llm inference clusters for performance and energy efficiency
The rapid evolution and widespread adoption of generative large language models (LLMs)
have made them a pivotal workload in various applications. Today, LLM inference clusters …
have made them a pivotal workload in various applications. Today, LLM inference clusters …
Queue Management for SLO-Oriented Large Language Model Serving
Large language model (LLM) serving is becoming an increasingly critical workload for cloud
providers. Existing LLM serving systems focus on interactive requests, such as chatbots and …
providers. Existing LLM serving systems focus on interactive requests, such as chatbots and …
Concise thoughts: Impact of output length on llm reasoning and cost
S Nayab, G Rossolini, M Simoni, A Saracino… - arxiv preprint arxiv …, 2024 - arxiv.org
Today's large language models (LLMs) can solve challenging question-answering tasks,
and prompt engineering techniques, such as chain-of-thought (CoT), have gained attention …
and prompt engineering techniques, such as chain-of-thought (CoT), have gained attention …
TAPAS: Thermal-and Power-Aware Scheduling for LLM Inference in Cloud Platforms
The rising demand for generative large language models (LLMs) poses challenges for
thermal and power management in cloud datacenters. Traditional techniques often are …
thermal and power management in cloud datacenters. Traditional techniques often are …
Layerkv: Optimizing large language model serving with layer-wise kv cache management
Y **ong, H Wu, C Shao, Z Wang, R Zhang… - arxiv preprint arxiv …, 2024 - arxiv.org
The expanding context windows in large language models (LLMs) have greatly enhanced
their capabilities in various applications, but they also introduce significant challenges in …
their capabilities in various applications, but they also introduce significant challenges in …
Intelligent router for llm workloads: Improving performance through workload-aware scheduling
Large Language Model (LLM) workloads have distinct prefill and decode phases with
different compute and memory requirements which should ideally be accounted for when …
different compute and memory requirements which should ideally be accounted for when …
Don't Stop Me Now: Embedding Based Scheduling for LLMs
Efficient scheduling is crucial for interactive Large Language Model (LLM) applications,
where low request completion time directly impacts user engagement. Size-based …
where low request completion time directly impacts user engagement. Size-based …
Fast inference for augmented large language models
Augmented Large Language Models (LLMs) enhance the capabilities of standalone LLMs
by integrating external data sources through API calls. In interactive LLM applications …
by integrating external data sources through API calls. In interactive LLM applications …
Multi-Bin Batching for Increasing LLM Inference Throughput
As large language models (LLMs) grow in popularity for their diverse capabilities, improving
the efficiency of their inference systems has become increasingly critical. Batching LLM …
the efficiency of their inference systems has become increasingly critical. Batching LLM …