A survey on deep neural network pruning: Taxonomy, comparison, analysis, and recommendations
Modern deep neural networks, particularly recent large language models, come with
massive model sizes that require significant computational and storage resources. To …
massive model sizes that require significant computational and storage resources. To …
Gpt3. int8 (): 8-bit matrix multiplication for transformers at scale
Large language models have been widely adopted but require significant GPU memory for
inference. We develop a procedure for Int8 matrix multiplication for feed-forward and …
inference. We develop a procedure for Int8 matrix multiplication for feed-forward and …
A simple and effective pruning approach for large language models
The optimal bert surgeon: Scalable and accurate second-order pruning for large language models
Transformer-based language models have become a key building block for natural
language processing. While these models are extremely accurate, they can be too large and …
language processing. While these models are extremely accurate, they can be too large and …