A survey on deep neural network pruning: Taxonomy, comparison, analysis, and recommendations
Modern deep neural networks, particularly recent large language models, come with
massive model sizes that require significant computational and storage resources. To …
massive model sizes that require significant computational and storage resources. To …
Client selection in federated learning: Principles, challenges, and opportunities
As a privacy-preserving paradigm for training machine learning (ML) models, federated
learning (FL) has received tremendous attention from both industry and academia. In a …
learning (FL) has received tremendous attention from both industry and academia. In a …
Depgraph: Towards any structural pruning
Structural pruning enables model acceleration by removing structurally-grouped parameters
from neural networks. However, the parameter-grou** patterns vary widely across …
from neural networks. However, the parameter-grou** patterns vary widely across …
Sparsegpt: Massive language models can be accurately pruned in one-shot
We show for the first time that large-scale generative pretrained transformer (GPT) family
models can be pruned to at least 50% sparsity in one-shot, without any retraining, at minimal …
models can be pruned to at least 50% sparsity in one-shot, without any retraining, at minimal …
Towards automated circuit discovery for mechanistic interpretability
Through considerable effort and intuition, several recent works have reverse-engineered
nontrivial behaviors oftransformer models. This paper systematizes the mechanistic …
nontrivial behaviors oftransformer models. This paper systematizes the mechanistic …
Taxonomy of risks posed by language models
Responsible innovation on large-scale Language Models (LMs) requires foresight into and
in-depth understanding of the risks these models may pose. This paper develops a …
in-depth understanding of the risks these models may pose. This paper develops a …
Flashattention: Fast and memory-efficient exact attention with io-awareness
Transformers are slow and memory-hungry on long sequences, since the time and memory
complexity of self-attention are quadratic in sequence length. Approximate attention …
complexity of self-attention are quadratic in sequence length. Approximate attention …
A simple and effective pruning approach for large language models
As their size increases, Large Languages Models (LLMs) are natural candidates for network
pruning methods: approaches that drop a subset of network weights while striving to …
pruning methods: approaches that drop a subset of network weights while striving to …
Patch diffusion: Faster and more data-efficient training of diffusion models
Diffusion models are powerful, but they require a lot of time and data to train. We propose
Patch Diffusion, a generic patch-wise training framework, to significantly reduce the training …
Patch Diffusion, a generic patch-wise training framework, to significantly reduce the training …
Sheared llama: Accelerating language model pre-training via structured pruning
The popularity of LLaMA (Touvron et al., 2023a; b) and other recently emerged moderate-
sized large language models (LLMs) highlights the potential of building smaller yet powerful …
sized large language models (LLMs) highlights the potential of building smaller yet powerful …