Transformers are ssms: Generalized models and efficient algorithms through structured state space duality

T Dao, A Gu - arxiv preprint arxiv:2405.21060, 2024 - arxiv.org
While Transformers have been the main architecture behind deep learning's success in
language modeling, state-space models (SSMs) such as Mamba have recently been shown …

Gated linear attention transformers with hardware-efficient training

S Yang, B Wang, Y Shen, R Panda, Y Kim - arxiv preprint arxiv …, 2023 - arxiv.org
Transformers with linear attention allow for efficient parallel training but can simultaneously
be formulated as an RNN with 2D (matrix-valued) hidden states, thus enjoying linear-time …

The mamba in the llama: Distilling and accelerating hybrid models

J Wang, D Paliotta, A May, A Rush… - Advances in Neural …, 2025 - proceedings.neurips.cc
Linear RNN architectures, like Mamba, can be competitive with Transformer models in
language modeling while having advantageous deployment characteristics. Given the focus …

Hydra: Bidirectional state space models through generalized matrix mixers

S Hwang, AS Lahoti, R Puduppully… - Advances in Neural …, 2025 - proceedings.neurips.cc
A wide array of sequence models are built on a framework modeled after Transformers,
comprising alternating sequence mixer and channel mixer layers. This paper studies a …

Samba: Simple hybrid state space models for efficient unlimited context language modeling

L Ren, Y Liu, Y Lu, Y Shen, C Liang… - arxiv preprint arxiv …, 2024 - arxiv.org
Efficiently modeling sequences with infinite context length has been a long-standing
problem. Past works suffer from either the quadratic computation complexity or the limited …

MetaLA: Unified optimal linear approximation to softmax attention map

Y Chou, M Yao, K Wang, Y Pan… - Advances in …, 2025 - proceedings.neurips.cc
Various linear complexity models, such as Linear Transformer (LinFormer), State Space
Model (SSM), and Linear RNN (LinRNN), have been proposed to replace the conventional …

How to train long-context language models (effectively)

T Gao, A Wettig, H Yen, D Chen - arxiv preprint arxiv:2410.02660, 2024 - arxiv.org
We study continued training and supervised fine-tuning (SFT) of a language model (LM) to
make effective use of long-context information. We first establish a reliable evaluation …

Venturing into uncharted waters: The navigation compass from transformer to mamba

Y Zou, Y Chen, Z Li, L Zhang, H Zhao - arxiv preprint arxiv:2406.16722, 2024 - arxiv.org
Transformer, a deep neural network architecture, has long dominated the field of natural
language processing and beyond. Nevertheless, the recent introduction of Mamba …

Orchid: Flexible and data-dependent convolution for sequence modeling

M Karami, A Ghodsi - Advances in Neural Information …, 2025 - proceedings.neurips.cc
In the rapidly evolving field of deep learning, the demand for models that are both
expressive and computationally efficient has never been more critical. This paper introduces …

Just read twice: closing the recall gap for recurrent language models

S Arora, A Timalsina, A Singhal, B Spector… - arxiv preprint arxiv …, 2024 - arxiv.org
Recurrent large language models that compete with Transformers in language modeling
perplexity are emerging at a rapid rate (eg, Mamba, RWKV). Excitingly, these architectures …