A survey on deep neural network pruning: Taxonomy, comparison, analysis, and recommendations
Modern deep neural networks, particularly recent large language models, come with
massive model sizes that require significant computational and storage resources. To …
massive model sizes that require significant computational and storage resources. To …
Gpt3. int8 (): 8-bit matrix multiplication for transformers at scale
Large language models have been widely adopted but require significant GPU memory for
inference. We develop a procedure for Int8 matrix multiplication for feed-forward and …
inference. We develop a procedure for Int8 matrix multiplication for feed-forward and …
A simple and effective pruning approach for large language models
Outlier suppression: Pushing the limit of low-bit transformer language models
Transformer architecture has become the fundamental element of the widespread natural
language processing~(NLP) models. With the trends of large NLP models, the increasing …
language processing~(NLP) models. With the trends of large NLP models, the increasing …
Eliciting latent predictions from transformers with the tuned lens
We analyze transformers from the perspective of iterative inference, seeking to understand
how model predictions are refined layer by layer. To do so, we train an affine probe for each …
how model predictions are refined layer by layer. To do so, we train an affine probe for each …
Squeezellm: Dense-and-sparse quantization
Generative Large Language Models (LLMs) have demonstrated remarkable results for a
wide range of tasks. However, deploying these models for inference has been a significant …
wide range of tasks. However, deploying these models for inference has been a significant …
Outlier suppression+: Accurate quantization of large language models by equivalent and optimal shifting and scaling
Post-training quantization~(PTQ) of transformer language models faces significant
challenges due to the existence of detrimental outliers in activations. We observe that these …
challenges due to the existence of detrimental outliers in activations. We observe that these …
The optimal bert surgeon: Scalable and accurate second-order pruning for large language models
Transformer-based language models have become a key building block for natural
language processing. While these models are extremely accurate, they can be too large and …
language processing. While these models are extremely accurate, they can be too large and …