- Academic Search

H Qiu, W Mao, A Patke, S Cui, S Jha, C Wang… - 2024 USENIX Annual …, 2024 - usenix.org

With the increasing popularity of large deep learning model-serving workloads, there is a
pressing need to reduce the energy consumption of a model-serving cluster while …

保存引用被引用次数：11 相关文章所有 2 个版本 HTML 版

[Free GPT-4]

[PDF] arxiv.org

Dynamollm: Designing llm inference clusters for performance and energy efficiency

J Stojkovic, C Zhang, Í Goiri, J Torrellas… - arxiv preprint arxiv …, 2024 - arxiv.org

The rapid evolution and widespread adoption of generative large language models (LLMs)
have made them a pivotal workload in various applications. Today, LLM inference clusters …

保存引用被引用次数：17 相关文章所有 3 个版本 HTML 版

[Free GPT-4]

[PDF] acm.org

Queue Management for SLO-Oriented Large Language Model Serving

A Patke, D Reddy, S Jha, H Qiu, C Pinto… - Proceedings of the …, 2024 - dl.acm.org

Large language model (LLM) serving is becoming an increasingly critical workload for cloud
providers. Existing LLM serving systems focus on interactive requests, such as chatbots and …

保存引用被引用次数：3 相关文章

[Free GPT-4]

[PDF] arxiv.org

Concise thoughts: Impact of output length on llm reasoning and cost

S Nayab, G Rossolini, M Simoni, A Saracino… - arxiv preprint arxiv …, 2024 - arxiv.org

Today's large language models (LLMs) can solve challenging question-answering tasks,
and prompt engineering techniques, such as chain-of-thought (CoT), have gained attention …

保存引用被引用次数：6 相关文章所有 2 个版本 HTML 版

[Free GPT-4]

[PDF] arxiv.org

TAPAS: Thermal-and Power-Aware Scheduling for LLM Inference in Cloud Platforms

J Stojkovic, C Zhang, Í Goiri, E Choukse, H Qiu… - arxiv preprint arxiv …, 2025 - arxiv.org

The rising demand for generative large language models (LLMs) poses challenges for
thermal and power management in cloud datacenters. Traditional techniques often are …

保存引用被引用次数：2 相关文章 HTML 版

[Free GPT-4]

[PDF] arxiv.org

Layerkv: Optimizing large language model serving with layer-wise kv cache management

Y **ong, H Wu, C Shao, Z Wang, R Zhang… - arxiv preprint arxiv …, 2024 - arxiv.org

The expanding context windows in large language models (LLMs) have greatly enhanced
their capabilities in various applications, but they also introduce significant challenges in …

保存引用被引用次数：2 相关文章 HTML 版

[Free GPT-4]

[PDF] arxiv.org

Intelligent router for llm workloads: Improving performance through workload-aware scheduling

K Jain, A Parayil, A Mallick, E Choukse, X Qin… - arxiv preprint arxiv …, 2024 - arxiv.org

Large Language Model (LLM) workloads have distinct prefill and decode phases with
different compute and memory requirements which should ideally be accounted for when …

保存引用被引用次数：2 相关文章所有 3 个版本 HTML 版

[Free GPT-4]

[PDF] arxiv.org

Don't Stop Me Now: Embedding Based Scheduling for LLMs

R Shahout, E Malach, C Liu, W Jiang, M Yu… - arxiv preprint arxiv …, 2024 - arxiv.org

Efficient scheduling is crucial for interactive Large Language Model (LLM) applications,
where low request completion time directly impacts user engagement. Size-based …

保存引用被引用次数：2 相关文章 HTML 版

[Free GPT-4]

[PDF] arxiv.org

Fast inference for augmented large language models

R Shahout, C Liang, S **n, Q Lao, Y Cui, M Yu… - arxiv preprint arxiv …, 2024 - arxiv.org

Augmented Large Language Models (LLMs) enhance the capabilities of standalone LLMs
by integrating external data sources through API calls. In interactive LLM applications …

保存引用被引用次数：1 相关文章 HTML 版

[Free GPT-4]

[PDF] arxiv.org

Multi-Bin Batching for Increasing LLM Inference Throughput

O Guldogan, J Kunde, K Lee, R Pedarsani - arxiv preprint arxiv …, 2024 - arxiv.org

As large language models (LLMs) grow in popularity for their diverse capabilities, improving
the efficiency of their inference systems has become increasingly critical. Batching LLM …

保存引用被引用次数：1 相关文章所有 2 个版本 HTML 版

创建快讯

引用

高级搜索

已保存到“我的图书馆”

Efficient interactive LLM serving with proxy model-based sequence length prediction

Power-aware Deep Learning Model Serving with {μ-Serve}

Dynamollm: Designing llm inference clusters for performance and energy efficiency

Queue Management for SLO-Oriented Large Language Model Serving

Concise thoughts: Impact of output length on llm reasoning and cost

TAPAS: Thermal-and Power-Aware Scheduling for LLM Inference in Cloud Platforms

Layerkv: Optimizing large language model serving with layer-wise kv cache management

Intelligent router for llm workloads: Improving performance through workload-aware scheduling

Don't Stop Me Now: Embedding Based Scheduling for LLMs

Fast inference for augmented large language models

Multi-Bin Batching for Increasing LLM Inference Throughput