- Academic Search

H Cheng, M Zhang, JQ Shi - IEEE Transactions on Pattern …, 2024 - ieeexplore.ieee.org

Modern deep neural networks, particularly recent large language models, come with
massive model sizes that require significant computational and storage resources. To …

Zapisz Cytuj Cytowane przez 128 Powiązane artykuły Wszystkie wersje 2

[Free GPT-4]

[PDF] neurips.cc

Gpt3. int8 (): 8-bit matrix multiplication for transformers at scale

T Dettmers, M Lewis, Y Belkada… - Advances in Neural …, 2022 - proceedings.neurips.cc

Large language models have been widely adopted but require significant GPU memory for
inference. We develop a procedure for Int8 matrix multiplication for feed-forward and …

Zapisz Cytuj Cytowane przez 965 Powiązane artykuły Wszystkie wersje 6 Wersja HTML

[Free GPT-4]

[PDF] arxiv.org

A simple and effective pruning approach for large language models

M Sun, Z Liu, A Bair, JZ Kolter - ar** attention heads do nothing

Y Bondarenko, M Nagel… - Advances in Neural …, 2023 - proceedings.neurips.cc

Transformer models have been widely adopted in various domains over the last years and
especially large language models have advanced the field of AI significantly. Due to their …

Zapisz Cytuj Cytowane przez 75 Powiązane artykuły Wszystkie wersje 5 Wersja HTML

[Free GPT-4]

[PDF] neurips.cc

Outlier suppression: Pushing the limit of low-bit transformer language models

X Wei, Y Zhang, X Zhang, R Gong… - Advances in …, 2022 - proceedings.neurips.cc

Transformer architecture has become the fundamental element of the widespread natural
language processing~(NLP) models. With the trends of large NLP models, the increasing …

Zapisz Cytuj Cytowane przez 129 Powiązane artykuły Wszystkie wersje 5 Wersja HTML

[Free GPT-4]

[PDF] arxiv.org

Eliciting latent predictions from transformers with the tuned lens

N Belrose, Z Furman, L Smith, D Halawi… - arxiv preprint arxiv …, 2023 - arxiv.org

We analyze transformers from the perspective of iterative inference, seeking to understand
how model predictions are refined layer by layer. To do so, we train an affine probe for each …

Zapisz Cytuj Cytowane przez 137 Powiązane artykuły Wszystkie wersje 2 Wersja HTML

[Free GPT-4]

[PDF] arxiv.org

Squeezellm: Dense-and-sparse quantization

S Kim, C Hooper, A Gholami, Z Dong, X Li… - arxiv preprint arxiv …, 2023 - arxiv.org

Generative Large Language Models (LLMs) have demonstrated remarkable results for a
wide range of tasks. However, deploying these models for inference has been a significant …

Zapisz Cytuj Cytowane przez 166 Powiązane artykuły Wszystkie wersje 4 Wersja HTML

[Free GPT-4]

[PDF] arxiv.org

Outlier suppression+: Accurate quantization of large language models by equivalent and optimal shifting and scaling

X Wei, Y Zhang, Y Li, X Zhang, R Gong, J Guo… - arxiv preprint arxiv …, 2023 - arxiv.org

Post-training quantization~(PTQ) of transformer language models faces significant
challenges due to the existence of detrimental outliers in activations. We observe that these …

Zapisz Cytuj Cytowane przez 106 Powiązane artykuły Wszystkie wersje 6 Wersja HTML

[Free GPT-4]

[PDF] arxiv.org

The optimal bert surgeon: Scalable and accurate second-order pruning for large language models

E Kurtic, D Campos, T Nguyen, E Frantar… - arxiv preprint arxiv …, 2022 - arxiv.org

Transformer-based language models have become a key building block for natural
language processing. While these models are extremely accurate, they can be too large and …

Zapisz Cytuj Cytowane przez 130 Powiązane artykuły Wszystkie wersje 3 Wersja HTML

Cytuj

Szukanie zaawansowane

Zapisano w Mojej bibliotece

A survey on deep neural network pruning: Taxonomy, comparison, analysis, and recommendations

Gpt3. int8 (): 8-bit matrix multiplication for transformers at scale

A simple and effective pruning approach for large language models

Outlier suppression: Pushing the limit of low-bit transformer language models

Eliciting latent predictions from transformers with the tuned lens

Squeezellm: Dense-and-sparse quantization

Outlier suppression+: Accurate quantization of large language models by equivalent and optimal shifting and scaling

The optimal bert surgeon: Scalable and accurate second-order pruning for large language models