Splitwise: Efficient generative llm inference using phase splitting

P Patel, E Choukse, C Zhang, A Shah… - 2024 ACM/IEEE 51st …, 2024 - ieeexplore.ieee.org
Generative large language model (LLM) applications are growing rapidly, leading to large-
scale deployments of expensive and power-hungry GPUs. Our characterization of LLM …

Llm inference serving: Survey of recent advances and opportunities

B Li, Y Jiang, V Gadepally, D Tiwari - arxiv preprint arxiv:2407.12391, 2024 - arxiv.org
This survey offers a comprehensive overview of recent advancements in Large Language
Model (LLM) serving systems, focusing on research since the year 2023. We specifically …

Dynamollm: Designing llm inference clusters for performance and energy efficiency

J Stojkovic, C Zhang, Í Goiri, J Torrellas… - arxiv preprint arxiv …, 2024 - arxiv.org
The rapid evolution and widespread adoption of generative large language models (LLMs)
have made them a pivotal workload in various applications. Today, LLM inference clusters …

Offline energy-optimal llm serving: Workload-based energy models for llm inference on heterogeneous systems

G Wilkins, S Keshav, R Mortier - arxiv preprint arxiv:2407.04014, 2024 - arxiv.org
The rapid adoption of large language models (LLMs) has led to significant advances in
natural language processing and text generation. However, the energy consumed through …

Reconciling the contrasting narratives on the environmental impact of large language models

S Ren, B Tomlinson, RW Black, AW Torrance - Scientific Reports, 2024 - nature.com
The recent proliferation of large language models (LLMs) has led to divergent narratives
about their environmental impacts. Some studies highlight the substantial carbon footprint of …

Perllm: Personalized inference scheduling with edge-cloud collaboration for diverse llm services

Z Yang, Y Yang, C Zhao, Q Guo, W He, W Ji - arxiv preprint arxiv …, 2024 - arxiv.org
With the rapid growth in the number of large language model (LLM) users, it is difficult for
bandwidth-constrained cloud servers to simultaneously process massive LLM services in …

TAPAS: Thermal-and Power-Aware Scheduling for LLM Inference in Cloud Platforms

J Stojkovic, C Zhang, Í Goiri, E Choukse, H Qiu… - arxiv preprint arxiv …, 2025 - arxiv.org
The rising demand for generative large language models (LLMs) poses challenges for
thermal and power management in cloud datacenters. Traditional techniques often are …

A survey of small language models

C Van Nguyen, X Shen, R Aponte, Y **a… - arxiv preprint arxiv …, 2024 - arxiv.org
Small Language Models (SLMs) have become increasingly important due to their efficiency
and performance to perform various language tasks with minimal computational resources …

The unseen AI disruptions for power grids: LLM-induced transients

Y Li, M Mughees, Y Chen, YR Li - arxiv preprint arxiv:2409.11416, 2024 - arxiv.org
Recent breakthroughs of large language models (LLMs) have exhibited superior capability
across major industries and stimulated multi-hundred-billion-dollar investment in AI-centric …

Datacenter power and energy management: past, present, and future

R Bianchini, C Belady, A Sivasubramaniam - IEEE Micro, 2024 - ieeexplore.ieee.org
This article overviews some of the key past developments in cloud data center power and
energy management, where we are today, and what the future could be. This topic is gaining …