Transformers are ssms: Generalized models and efficient algorithms through structured state space duality

T Dao, A Gu - arxiv preprint arxiv:2405.21060, 2024 - arxiv.org
While Transformers have been the main architecture behind deep learning's success in
language modeling, state-space models (SSMs) such as Mamba have recently been shown …

Gated linear attention transformers with hardware-efficient training

S Yang, B Wang, Y Shen, R Panda, Y Kim - arxiv preprint arxiv …, 2023 - arxiv.org
Transformers with linear attention allow for efficient parallel training but can simultaneously
be formulated as an RNN with 2D (matrix-valued) hidden states, thus enjoying linear-time …

xlstm: Extended long short-term memory

M Beck, K Pöppel, M Spanring, A Auer… - arxiv preprint arxiv …, 2024 - arxiv.org
In the 1990s, the constant error carousel and gating were introduced as the central ideas of
the Long Short-Term Memory (LSTM). Since then, LSTMs have stood the test of time and …

Learning to (learn at test time): Rnns with expressive hidden states

Y Sun, X Li, K Dalal, J Xu, A Vikram, G Zhang… - arxiv preprint arxiv …, 2024 - arxiv.org
Self-attention performs well in long context but has quadratic complexity. Existing RNN
layers have linear complexity, but their performance in long context is limited by the …

Zigma: A dit-style zigzag mamba diffusion model

VT Hu, SA Baumann, M Gui, O Grebenkova… - … on Computer Vision, 2024 - Springer
The diffusion model has long been plagued by scalability and quadratic complexity issues,
especially within transformer-based structures. In this study, we aim to leverage the long …

Scaling laws for precision

T Kumar, Z Ankner, BF Spector, B Bordelon… - arxiv preprint arxiv …, 2024 - arxiv.org
Low precision training and inference affect both the quality and cost of language models, but
current scaling laws do not account for this. In this work, we devise" precision-aware" scaling …

Linfusion: 1 gpu, 1 minute, 16k image

S Liu, W Yu, Z Tan, X Wang - arxiv preprint arxiv:2409.02097, 2024 - arxiv.org
Modern diffusion models, particularly those utilizing a Transformer-based UNet for
denoising, rely heavily on self-attention operations to manage complex spatial relationships …

Mamba or rwkv: Exploring high-quality and high-efficiency segment anything model

H Yuan, X Li, L Qi, T Zhang, MH Yang, S Yan… - arxiv preprint arxiv …, 2024 - arxiv.org
Transformer-based segmentation methods face the challenge of efficient inference when
dealing with high-resolution images. Recently, several linear attention architectures, such as …

Autoregressive pretraining with mamba in vision

S Ren, X Li, H Tu, F Wang, F Shu, L Zhang… - arxiv preprint arxiv …, 2024 - arxiv.org
The vision community has started to build with the recently developed state space model,
Mamba, as the new backbone for a range of tasks. This paper shows that Mamba's visual …

Pointrwkv: Efficient rwkv-like model for hierarchical point cloud learning

Q He, J Zhang, J Peng, H He, X Li, Y Wang… - arxiv preprint arxiv …, 2024 - arxiv.org
Transformers have revolutionized the point cloud learning task, but the quadratic complexity
hinders its extension to long sequence and makes a burden on limited computational …