Toward High-Performance LLM Serving: A Simulation-Based Approach for Identifying Optimal Parallelism

YC Lin, W Kwon, R Pineda, FN Paravecino - arxiv preprint arxiv …, 2024 - arxiv.org
Serving Large Language Models (LLMs) efficiently has become crucial. LLMs are often
served with multiple devices using techniques like data, pipeline, and tensor parallelisms …

Make LLM Inference Affordable to Everyone: Augmenting GPU Memory with NDP-DIMM

L Liu, S Zhao, B Li, H Ren, Z Xu, M Wang, X Li… - arxiv preprint arxiv …, 2025 - arxiv.org
The billion-scale Large Language Models (LLMs) need deployment on expensive server-
grade GPUs with large-storage HBMs and abundant computation capability. As LLM …

[PDF][PDF] Advancements in Quasi-Newton Methods for Large-Scale Optimization

V Choudhary, K Mehta, S Desai, A Nair, R Iyer… - researchgate.net
Large-scale optimization problems pose significant challenges, particularly when traditional
gradient methods struggle with efficiency in high-dimensional spaces. Quasi-Newton …