Cramming: Training a Language Model on a single GPU in one day.

J Gei**, T Goldstein - International Conference on …, 2023 - proceedings.mlr.press
Recent trends in language modeling have focused on increasing performance through
scaling, and have resulted in an environment where training language models is out of …

Improving transformers with dynamically composable multi-head attention

D **ao, Q Meng, S Li, X Yuan - arxiv preprint arxiv:2405.08553, 2024 - arxiv.org
Multi-Head Attention (MHA) is a key component of Transformer. In MHA, attention heads
work independently, causing problems such as low-rank bottleneck of attention score …

T3SRS: Tensor Train Transformer for compressing sequential recommender systems

H Li, J Zhao, H Huo, S Fang, J Chen, L Yao… - Expert Systems with …, 2024 - Elsevier
In recent years, attention mechanisms have gained popularity in sequential recommender
systems (SRSs) due to obtaining dynamic user preferences efficiently. However, over …

Geometry is All You Need: A Unified Taxonomy of Matrix and Tensor Factorization for Compression of Generative Language Models

M Xu, S Sharmin, DP Mandic - arxiv preprint arxiv:2410.03040, 2024 - arxiv.org
Matrix and tensor-guided parametrization for Natural Language Processing (NLP) models is
fundamentally useful for the improvement of the model's systematic efficiency. However, the …

Vision Transformer with Irregular Attention

D Ermilov, N Kozyrskiy, I Vorona, ANHHUY PHAN… - openreview.net
Compression of Transformer is a natural request that arose in the computer vision
community. Apart from quantization that hardly rely on hardware, sparsification is another …