Deja vu: Contextual sparsity for efficient llms at inference time
Large language models (LLMs) with hundreds of billions of parameters have sparked a new
wave of exciting AI applications. However, they are computationally expensive at inference …
wave of exciting AI applications. However, they are computationally expensive at inference …
Scatterbrain: Unifying sparse and low-rank attention
Recent advances in efficient Transformers have exploited either the sparsity or low-rank
properties of attention matrices to reduce the computational and memory bottlenecks of …
properties of attention matrices to reduce the computational and memory bottlenecks of …
The lazy neuron phenomenon: On emergence of activation sparsity in transformers
This paper studies the curious phenomenon for machine learning models with Transformer
architectures that their activation maps are sparse. By activation map we refer to the …
architectures that their activation maps are sparse. By activation map we refer to the …
Pixelated butterfly: Simple and efficient sparse training for neural network models
Overparameterized neural networks generalize well but are expensive to train. Ideally, one
would like to reduce their computational cost while retaining their generalization benefits …
would like to reduce their computational cost while retaining their generalization benefits …
Sparse spiking gradient descent
There is an increasing interest in emulating Spiking Neural Networks (SNNs) on
neuromorphic computing devices due to their low energy consumption. Recent advances …
neuromorphic computing devices due to their low energy consumption. Recent advances …
Bypass exponential time preprocessing: Fast neural network training via weight-data correlation preprocessing
Over the last decade, deep neural networks have transformed our society, and they are
already widely applied in various machine learning applications. State-of-the-art deep …
already widely applied in various machine learning applications. State-of-the-art deep …
Does preprocessing help training over-parameterized neural networks?
Deep neural networks have achieved impressive performance in many areas. Designing a
fast and provable method for training neural networks is a fundamental question in machine …
fast and provable method for training neural networks is a fundamental question in machine …
A survey on large-scale machine learning
Machine learning can provide deep insights into data, allowing machines to make high-
quality predictions and having been widely used in real-world applications, such as text …
quality predictions and having been widely used in real-world applications, such as text …
Training multi-layer over-parametrized neural network in subquadratic time
We consider the problem of training a multi-layer over-parametrized neural network to
minimize the empirical risk induced by a loss function. In the typical setting of over …
minimize the empirical risk induced by a loss function. In the typical setting of over …
An efficient statistical-based gradient compression technique for distributed training systems
The recent many-fold increase in the size of deep neural networks makes efficient
distributed training challenging. Many proposals exploit the compressibility of the gradients …
distributed training challenging. Many proposals exploit the compressibility of the gradients …