- Academic Search

H Cheng, M Zhang, JQ Shi - IEEE Transactions on Pattern …, 2024 - ieeexplore.ieee.org

Modern deep neural networks, particularly recent large language models, come with
massive model sizes that require significant computational and storage resources. To …

Salva Cita Citato da 126 Articoli correlati Tutte e 2 le versioni

[Free GPT-4]

[PDF] arxiv.org

A survey of techniques for optimizing transformer inference

KT Chitty-Venkata, S Mittal, M Emani… - Journal of Systems …, 2023 - Elsevier

Recent years have seen a phenomenal rise in the performance and applications of
transformer neural networks. The family of transformer networks, including Bidirectional …

Salva Cita Citato da 68 Articoli correlati Tutte e 6 le versioni

[Free GPT-4]

[PDF] neurips.cc

Llm-pruner: On the structural pruning of large language models

X Ma, G Fang, X Wang - Advances in neural information …, 2023 - proceedings.neurips.cc

Large language models (LLMs) have shown remarkable capabilities in language
understanding and generation. However, such impressive capability typically comes with a …

Salva Cita Citato da 492 Articoli correlati Tutte e 5 le versioni Versione HTML

[Free GPT-4]

[PDF] mlr.press

Sparsegpt: Massive language models can be accurately pruned in one-shot

E Frantar, D Alistarh - International Conference on Machine …, 2023 - proceedings.mlr.press

We show for the first time that large-scale generative pretrained transformer (GPT) family
models can be pruned to at least 50% sparsity in one-shot, without any retraining, at minimal …

Salva Cita Citato da 527 Articoli correlati Tutte e 8 le versioni Versione HTML

[Free GPT-4]

[PDF] arxiv.org

A simple and effective pruning approach for large language models

M Sun, Z Liu, A Bair, JZ Kolter - arxiv preprint arxiv:2306.11695, 2023 - arxiv.org

As their size increases, Large Languages Models (LLMs) are natural candidates for network
pruning methods: approaches that drop a subset of network weights while striving to …

Salva Cita Citato da 444 Articoli correlati Tutte e 5 le versioni Versione HTML

[Free GPT-4]

[PDF] usenix.org

{InfiniGen}: Efficient generative inference of large language models with dynamic {KV} cache management

W Lee, J Lee, J Seo, J Sim - 18th USENIX Symposium on Operating …, 2024 - usenix.org

Transformer-based large language models (LLMs) demonstrate impressive performance
across various natural language processing tasks. Serving LLM inference for generating …

Salva Cita Citato da 38 Articoli correlati Versione HTML

[Free GPT-4]

[PDF] arxiv.org

Squeezellm: Dense-and-sparse quantization

S Kim, C Hooper, A Gholami, Z Dong, X Li… - arxiv preprint arxiv …, 2023 - arxiv.org

Generative Large Language Models (LLMs) have demonstrated remarkable results for a
wide range of tasks. However, deploying these models for inference has been a significant …

Salva Cita Citato da 168 Articoli correlati Tutte e 4 le versioni Versione HTML

[Free GPT-4]

[PDF] neurips.cc

Speculative decoding with big little decoder

S Kim, K Mangalam, S Moon, J Malik… - Advances in …, 2024 - proceedings.neurips.cc

The recent emergence of Large Language Models based on the Transformer architecture
has enabled dramatic advancements in the field of Natural Language Processing. However …

Salva Cita Citato da 69 Articoli correlati Tutte e 5 le versioni Versione HTML

[Free GPT-4]

[PDF] arxiv.org

Full stack optimization of transformer inference: a survey

S Kim, C Hooper, T Wattanawong, M Kang… - arxiv preprint arxiv …, 2023 - arxiv.org

Recent advances in state-of-the-art DNN architecture design have been moving toward
Transformer models. These models achieve superior accuracy across a wide range of …

Salva Cita Citato da 94 Articoli correlati Tutte e 4 le versioni Versione HTML

[Free GPT-4]

[PDF] openreview.net

Shortened llama: A simple depth pruning for large language models

BK Kim, G Kim, TH Kim, T Castells, S Choi… - arxiv preprint arxiv …, 2024 - openreview.net

Structured pruning of modern large language models (LLMs) has emerged as a way of
decreasing their high computational needs. Width pruning reduces the size of projection …

Salva Cita Citato da 41 Articoli correlati Tutte e 2 le versioni Versione HTML

Crea avviso

Cita

Ricerca avanzata

Salvato in La mia biblioteca

A fast post-training pruning framework for transformers

A survey on deep neural network pruning: Taxonomy, comparison, analysis, and recommendations

A survey of techniques for optimizing transformer inference

Llm-pruner: On the structural pruning of large language models

Sparsegpt: Massive language models can be accurately pruned in one-shot

A simple and effective pruning approach for large language models

{InfiniGen}: Efficient generative inference of large language models with dynamic {KV} cache management

Squeezellm: Dense-and-sparse quantization

Speculative decoding with big little decoder

Full stack optimization of transformer inference: a survey

Shortened llama: A simple depth pruning for large language models