Deja vu: Contextual sparsity for efficient llms at inference time

Z Liu, J Wang, T Dao, T Zhou, B Yuan… - International …, 2023 - proceedings.mlr.press
Large language models (LLMs) with hundreds of billions of parameters have sparked a new
wave of exciting AI applications. However, they are computationally expensive at inference …

Scatterbrain: Unifying sparse and low-rank attention

B Chen, T Dao, E Winsor, Z Song… - Advances in Neural …, 2021 - proceedings.neurips.cc
Recent advances in efficient Transformers have exploited either the sparsity or low-rank
properties of attention matrices to reduce the computational and memory bottlenecks of …

The lazy neuron phenomenon: On emergence of activation sparsity in transformers

Z Li, C You, S Bhojanapalli, D Li, AS Rawat… - arxiv preprint arxiv …, 2022 - arxiv.org
This paper studies the curious phenomenon for machine learning models with Transformer
architectures that their activation maps are sparse. By activation map we refer to the …

Pixelated butterfly: Simple and efficient sparse training for neural network models

T Dao, B Chen, K Liang, J Yang, Z Song… - arxiv preprint arxiv …, 2021 - arxiv.org
Overparameterized neural networks generalize well but are expensive to train. Ideally, one
would like to reduce their computational cost while retaining their generalization benefits …

Sparse spiking gradient descent

N Perez-Nieves, D Goodman - Advances in Neural …, 2021 - proceedings.neurips.cc
There is an increasing interest in emulating Spiking Neural Networks (SNNs) on
neuromorphic computing devices due to their low energy consumption. Recent advances …

Bypass exponential time preprocessing: Fast neural network training via weight-data correlation preprocessing

J Alman, Z Song, R Zhang… - Advances in Neural …, 2024 - proceedings.neurips.cc
Over the last decade, deep neural networks have transformed our society, and they are
already widely applied in various machine learning applications. State-of-the-art deep …

Does preprocessing help training over-parameterized neural networks?

Z Song, S Yang, R Zhang - Advances in Neural Information …, 2021 - proceedings.neurips.cc
Deep neural networks have achieved impressive performance in many areas. Designing a
fast and provable method for training neural networks is a fundamental question in machine …

A survey on large-scale machine learning

M Wang, W Fu, X He, S Hao… - IEEE Transactions on …, 2020 - ieeexplore.ieee.org
Machine learning can provide deep insights into data, allowing machines to make high-
quality predictions and having been widely used in real-world applications, such as text …

Training multi-layer over-parametrized neural network in subquadratic time

Z Song, L Zhang, R Zhang - arxiv preprint arxiv:2112.07628, 2021 - arxiv.org
We consider the problem of training a multi-layer over-parametrized neural network to
minimize the empirical risk induced by a loss function. In the typical setting of over …

An efficient statistical-based gradient compression technique for distributed training systems

AM Abdelmoniem, A Elzanaty… - Proceedings of …, 2021 - proceedings.mlsys.org
The recent many-fold increase in the size of deep neural networks makes efficient
distributed training challenging. Many proposals exploit the compressibility of the gradients …