A comprehensive overview of large language models

H Naveed, AU Khan, S Qiu, M Saqib, S Anwar… - arxiv preprint arxiv …, 2023 - arxiv.org
Large Language Models (LLMs) have recently demonstrated remarkable capabilities in
natural language processing tasks and beyond. This success of LLMs has led to a large …

Flexgen: High-throughput generative inference of large language models with a single gpu

Y Sheng, L Zheng, B Yuan, Z Li… - International …, 2023 - proceedings.mlr.press
The high computational and memory requirements of large language model (LLM) inference
make it feasible only with multiple high-end accelerators. Motivated by the emerging …

Memory-efficient fine-tuning of compressed large language models via sub-4-bit integer quantization

J Kim, JH Lee, S Kim, J Park, KM Yoo… - Advances in Neural …, 2024 - proceedings.neurips.cc
Large language models (LLMs) face the challenges in fine-tuning and deployment due to
their high memory demands and computational costs. While parameter-efficient fine-tuning …

Lut-gemm: Quantized matrix multiplication based on luts for efficient inference in large-scale generative language models

G Park, B Park, M Kim, S Lee, J Kim, B Kwon… - arxiv preprint arxiv …, 2022 - arxiv.org
The recent advancements in self-supervised learning, combined with the Transformer
architecture, have enabled natural language processing (NLP) to achieve remarkably low …

Lq-lora: Low-rank plus quantized matrix decomposition for efficient language model finetuning

H Guo, P Greengard, EP **ng, Y Kim - arxiv preprint arxiv:2311.12023, 2023 - arxiv.org
We propose a simple approach for memory-efficient adaptation of pretrained language
models. Our approach uses an iterative algorithm to decompose each pretrained matrix into …

A comprehensive survey of compression algorithms for language models

S Park, J Choi, S Lee, U Kang - arxiv preprint arxiv:2401.15347, 2024 - arxiv.org
How can we compress language models without sacrificing accuracy? The number of
compression algorithms for language models is rapidly growing to benefit from remarkable …

[PDF][PDF] NOLA: Networks as linear combination of low rank random basis

SA Koohpayegani, KL Navaneet… - UMBC Faculty …, 2023 - klnavaneet.github.io
ABSTRACT Large Language Models (LLMs) have recently gained popularity due to their
impressive few-shot performance across various downstream tasks. However, fine-tuning all …

[HTML][HTML] LLM-Commentator: Novel fine-tuning strategies of large language models for automatic commentary generation using football event data

A Cook, O Karakuş - Knowledge-Based Systems, 2024 - Elsevier
Real-time commentary on football matches is a challenging task that requires precise and
coherent descriptions of events as they unfold. Traditional methods often fall short in …

Model compression and efficient inference for large language models: A survey

W Wang, W Chen, Y Luo, Y Long, Z Lin… - arxiv preprint arxiv …, 2024 - arxiv.org
Transformer based large language models have achieved tremendous success. However,
the significant memory and computational costs incurred during the inference process make …

Neo: Saving gpu memory crisis with cpu offloading for online llm inference

X Jiang, Y Zhou, S Cao, I Stoica, M Yu - arxiv preprint arxiv:2411.01142, 2024 - arxiv.org
Online LLM inference powers many exciting applications such as intelligent chatbots and
autonomous agents. Modern LLM inference engines widely rely on request batching to …