Efficiently scaling transformer inference

R Pope, S Douglas, A Chowdhery… - Proceedings of …, 2023 - proceedings.mlsys.org
We study the problem of efficient generative inference for Transformer models, in one of its
most challenging settings: large deep models, with tight latency targets and long sequence …

Machine translation systems based on classical-statistical-deep-learning approaches

S Sharma, M Diwakar, P Singh, V Singh, S Kadry… - Electronics, 2023 - mdpi.com
Over recent years, machine translation has achieved astounding accomplishments. Machine
translation has become more evident with the need to understand the information available …

Sparse is enough in scaling transformers

S Jaszczur, A Chowdhery… - Advances in …, 2021 - proceedings.neurips.cc
Large Transformer models yield impressive results on many tasks, but are expensive to
train, or even fine-tune, and so slow at decoding that their use and study becomes out of …

Exploring lottery ticket hypothesis in spiking neural networks

Y Kim, Y Li, H Park, Y Venkatesha, R Yin… - European Conference on …, 2022 - Springer
Abstract Spiking Neural Networks (SNNs) have recently emerged as a new generation of
low-power deep neural networks, which is suitable to be implemented on low-power …

Losing Heads in the Lottery: Pruning Transformer

M Behnke, K Heafield - The 2020 Conference on Empirical …, 2020 - research.ed.ac.uk
The attention mechanism is the crucial component of the transformer architecture. Recent
research shows that most attention heads are not confident in their decisions and can be …

Gradient flow in sparse neural networks and how lottery tickets win

U Evci, Y Ioannou, C Keskin, Y Dauphin - Proceedings of the AAAI …, 2022 - ojs.aaai.org
Abstract Sparse Neural Networks (NNs) can match the generalization of dense NNs using a
fraction of the compute/storage for inference, and have the potential to enable efficient …

Super tickets in pre-trained language models: From model compression to improving generalization

C Liang, S Zuo, M Chen, H Jiang, X Liu, P He… - arxiv preprint arxiv …, 2021 - arxiv.org
The Lottery Ticket Hypothesis suggests that an over-parametrized network consists
of``lottery tickets'', and training a certain collection of them (ie, a subnetwork) can match the …

Differentiable subset pruning of transformer heads

J Li, R Cotterell, M Sachan - Transactions of the Association for …, 2021 - direct.mit.edu
Multi-head attention, a collection of several attention mechanisms that independently attend
to different parts of the input, is the key ingredient in the Transformer. Recent work has …

The lottery ticket hypothesis for object recognition

S Girish, SR Maiya, K Gupta, H Chen… - Proceedings of the …, 2021 - openaccess.thecvf.com
Recognition tasks, such as object recognition and keypoint estimation, have seen
widespread adoption in recent years. Most state-of-the-art methods for these tasks use deep …

Small pre-trained language models can be fine-tuned as large models via over-parameterization

ZF Gao, K Zhou, P Liu, WX Zhao… - Proceedings of the 61st …, 2023 - aclanthology.org
By scaling the model size, large pre-trained language models (PLMs) have shown
remarkable performance in various natural language processing tasks, mostly outperforming …