- Academic Search

H Naveed, AU Khan, S Qiu, M Saqib, S Anwar… - arxiv preprint arxiv …, 2023 - arxiv.org

Large Language Models (LLMs) have recently demonstrated remarkable capabilities in
natural language processing tasks and beyond. This success of LLMs has led to a large …

Save Cite Cited by 693 Related articles All 3 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] mlr.press

Flexgen: High-throughput generative inference of large language models with a single gpu

Y Sheng, L Zheng, B Yuan, Z Li… - International …, 2023 - proceedings.mlr.press

The high computational and memory requirements of large language model (LLM) inference
make it feasible only with multiple high-end accelerators. Motivated by the emerging …

Save Cite Cited by 329 Related articles All 10 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] neurips.cc

Memory-efficient fine-tuning of compressed large language models via sub-4-bit integer quantization

J Kim, JH Lee, S Kim, J Park, KM Yoo… - Advances in Neural …, 2024 - proceedings.neurips.cc

Large language models (LLMs) face the challenges in fine-tuning and deployment due to
their high memory demands and computational costs. While parameter-efficient fine-tuning …

Save Cite Cited by 90 Related articles All 6 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] arxiv.org

Lut-gemm: Quantized matrix multiplication based on luts for efficient inference in large-scale generative language models

G Park, B Park, M Kim, S Lee, J Kim, B Kwon… - arxiv preprint arxiv …, 2022 - arxiv.org

The recent advancements in self-supervised learning, combined with the Transformer
architecture, have enabled natural language processing (NLP) to achieve remarkably low …

Save Cite Cited by 123 Related articles All 3 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] arxiv.org

Lq-lora: Low-rank plus quantized matrix decomposition for efficient language model finetuning

H Guo, P Greengard, EP **ng, Y Kim - arxiv preprint arxiv:2311.12023, 2023 - arxiv.org

We propose a simple approach for memory-efficient adaptation of pretrained language
models. Our approach uses an iterative algorithm to decompose each pretrained matrix into …

Save Cite Cited by 40 Related articles All 4 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] arxiv.org

A comprehensive survey of compression algorithms for language models

S Park, J Choi, S Lee, U Kang - arxiv preprint arxiv:2401.15347, 2024 - arxiv.org

How can we compress language models without sacrificing accuracy? The number of
compression algorithms for language models is rapidly growing to benefit from remarkable …

Save Cite Cited by 12 Related articles All 2 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] github.io

[PDF][PDF] NOLA: Networks as linear combination of low rank random basis

SA Koohpayegani, KL Navaneet… - UMBC Faculty …, 2023 - klnavaneet.github.io

ABSTRACT Large Language Models (LLMs) have recently gained popularity due to their
impressive few-shot performance across various downstream tasks. However, fine-tuning all …

Save Cite Cited by 11 Related articles All 2 versions Free GPT-4 View as HTML

[Free GPT-4]

[HTML] sciencedirect.com

[HTML][HTML] LLM-Commentator: Novel fine-tuning strategies of large language models for automatic commentary generation using football event data

A Cook, O Karakuş - Knowledge-Based Systems, 2024 - Elsevier

Real-time commentary on football matches is a challenging task that requires precise and
coherent descriptions of events as they unfold. Traditional methods often fall short in …

Save Cite Cited by 6 Related articles

[Free GPT-4]

[PDF] arxiv.org

Model compression and efficient inference for large language models: A survey

W Wang, W Chen, Y Luo, Y Long, Z Lin… - arxiv preprint arxiv …, 2024 - arxiv.org

Transformer based large language models have achieved tremendous success. However,
the significant memory and computational costs incurred during the inference process make …

Save Cite Cited by 26 Related articles All 2 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] arxiv.org

Neo: Saving gpu memory crisis with cpu offloading for online llm inference

X Jiang, Y Zhou, S Cao, I Stoica, M Yu - arxiv preprint arxiv:2411.01142, 2024 - arxiv.org

Online LLM inference powers many exciting applications such as intelligent chatbots and
autonomous agents. Modern LLM inference engines widely rely on request batching to …

Save Cite Cited by 3 Related articles View as HTML

Create alert

Cite

Advanced search

Saved to My library

Alphatuning: Quantization-aware parameter-efficient adaptation of large-scale pre-trained...

A comprehensive overview of large language models

Flexgen: High-throughput generative inference of large language models with a single gpu

Memory-efficient fine-tuning of compressed large language models via sub-4-bit integer quantization

Lut-gemm: Quantized matrix multiplication based on luts for efficient inference in large-scale generative language models

Lq-lora: Low-rank plus quantized matrix decomposition for efficient language model finetuning

A comprehensive survey of compression algorithms for language models

[PDF][PDF] NOLA: Networks as linear combination of low rank random basis

[HTML][HTML] LLM-Commentator: Novel fine-tuning strategies of large language models for automatic commentary generation using football event data

Model compression and efficient inference for large language models: A survey

Neo: Saving gpu memory crisis with cpu offloading for online llm inference