Large language model inference acceleration: A comprehensive hardware perspective

J Li, J Xu, S Huang, Y Chen, W Li, J Liu, Y Lian… - arxiv preprint arxiv …, 2024 - arxiv.org
Large Language Models (LLMs) have demonstrated remarkable capabilities across various
fields, from natural language understanding to text generation. Compared to non-generative …

Memory-Centric Computing: Recent Advances in Processing-in-DRAM

O Mutlu, A Olgun, GF Oliveira… - 2024 IEEE International …, 2024 - ieeexplore.ieee.org
Memory-centric computing aims to enable computation capability in and near all places
where data is generated and stored. As such, it can greatly reduce the large negative …

Duplex: A Device for Large Language Models with Mixture of Experts, Grouped Query Attention, and Continuous Batching

S Yun, K Kyung, J Cho, J Choi, J Kim… - 2024 57th IEEE/ACM …, 2024 - ieeexplore.ieee.org
Large language models (LLMs) have emerged due to their capability to generate high-
quality content across diverse contexts. To reduce their explosively increasing demands for …

PyGim: An Efficient Graph Neural Network Library for Real Processing-In-Memory Architectures

C Giannoula, P Yang, I Fernandez, J Yang… - Proceedings of the …, 2024 - dl.acm.org
Graph Neural Networks (GNNs) are emerging models to analyze graph-structure data. GNN
execution involves both compute-intensive and memory-intensive kernels. The latter kernels …

A Survey: Collaborative Hardware and Software Design in the Era of Large Language Models

C Guo, F Cheng, Z Du, J Kiessling, J Ku… - IEEE Circuits and …, 2025 - ieeexplore.ieee.org
The rapid development of large language models (LLMs) has significantly transformed the
field of artificial intelligence, demonstrating remarkable capabilities in natural language …

LLMServingSim: A HW/SW Co-Simulation Infrastructure for LLM Inference Serving at Scale

J Cho, M Kim, H Choi, G Heo… - 2024 IEEE International …, 2024 - ieeexplore.ieee.org
Recently, there has been an extensive research effort in building efficient large language
model (LLM) inference serving systems. These efforts not only include innovations in the …

INF^ 2: High-Throughput Generative Inference of Large Language Models using Near-Storage Processing

H Jang, S Noh, C Shin, J Jung, J Song… - arxiv preprint arxiv …, 2025 - arxiv.org
The growing memory and computational demands of large language models (LLMs) for
generative inference present significant challenges for practical deployment. One promising …

PAPI: Exploiting Dynamic Parallelism in Large Language Model Decoding with a Processing-In-Memory-Enabled Computing System

Y He, H Mao, C Giannoula, M Sadrosadati… - arxiv preprint arxiv …, 2025 - arxiv.org
Large language models (LLMs) are widely used for natural language understanding and
text generation. An LLM model relies on a time-consuming step called LLM decoding to …

PIM Is All You Need: A CXL-Enabled GPU-Free System for Large Language Model Inference

Y Gu, A Khadem, S Umesh, N Liang, X Servot… - arxiv preprint arxiv …, 2025 - arxiv.org
Large Language Model (LLM) inference uses an autoregressive manner to generate one
token at a time, which exhibits notably lower operational intensity compared to earlier …

GenAI at the Edge: Comprehensive Survey on Empowering Edge Devices

M Navardi, R Aalishah, Y Fu, Y Lin, H Li… - arxiv preprint arxiv …, 2025 - arxiv.org
Generative Artificial Intelligence (GenAI) applies models and algorithms such as Large
Language Model (LLM) and Foundation Model (FM) to generate new data. GenAI, as a …