Splitwise: Efficient generative llm inference using phase splitting

P Patel, E Choukse, C Zhang, A Shah… - 2024 ACM/IEEE 51st …, 2024 - ieeexplore.ieee.org
Generative large language model (LLM) applications are growing rapidly, leading to large-
scale deployments of expensive and power-hungry GPUs. Our characterization of LLM …

Power-aware Deep Learning Model Serving with {μ-Serve}

H Qiu, W Mao, A Patke, S Cui, S Jha, C Wang… - 2024 USENIX Annual …, 2024 - usenix.org
With the increasing popularity of large deep learning model-serving workloads, there is a
pressing need to reduce the energy consumption of a model-serving cluster while …

Designing cloud servers for lower carbon

J Wang, DS Berger, F Kazhamiaka… - 2024 ACM/IEEE 51st …, 2024 - ieeexplore.ieee.org
To mitigate climate change, we must reduce carbon emissions from hyperscale cloud
computing. We find that cloud compute servers cause the majority of emissions in a general …

Reducing energy bloat in large model training

JW Chung, Y Gu, I Jang, L Meng, N Bansal… - Proceedings of the …, 2024 - dl.acm.org
Training large AI models on numerous GPUs consumes a massive amount of energy,
making power delivery one of the largest limiting factors in building and operating …

Polca: Power oversubscription in llm cloud providers

P Patel, E Choukse, C Zhang, Í Goiri, B Warrier… - arxiv preprint arxiv …, 2023 - arxiv.org
Recent innovation in large language models (LLMs), and their myriad use-cases have
rapidly driven up the compute capacity demand for datacenter GPUs. Several cloud …

Characterizing Power Management Opportunities for LLMs in the Cloud

P Patel, E Choukse, C Zhang, Í Goiri, B Warrier… - Proceedings of the 29th …, 2024 - dl.acm.org
Recent innovation in large language models (LLMs), and their myriad use cases have
rapidly driven up the compute demand for datacenter GPUs. Several cloud providers and …

Towards Greener LLMs: Bringing Energy-Efficiency to the Forefront of LLM Inference

J Stojkovic, E Choukse, C Zhang, I Goiri… - arxiv preprint arxiv …, 2024 - arxiv.org
With the ubiquitous use of modern large language models (LLMs) across industries, the
inference serving for these models is ever expanding. Given the high compute and memory …

An agile pathway towards carbon-aware clouds

P Patel, T Gregersen, T Anderson - ACM SIGENERGY Energy …, 2024 - dl.acm.org
Climate change is a pressing threat to planetary well-being that can be addressed only by
rapid near-term actions across all sectors. Yet, the cloud computing sector, with its …

Guser: A GPGPU Power Stressmark Generator

Y Shan, Y Yang, X Qian, Z Yu - 2024 IEEE International …, 2024 - ieeexplore.ieee.org
Power stress mark is crucial for estimating Thermal Design Power (TDP) of GPGPUs to
ensure efficient power control. This paper proposes Guser, the first systematic methodology …

Input-Dependent Power Usage in GPUs

T Gregersen, P Patel, E Choukse - SC24-W: Workshops of the …, 2024 - ieeexplore.ieee.org
GPUs are known to be power-hungry, and due to the boom in artificial intelligence, they are
currently the major contributors to the high power demands of upcoming datacenters. Most …