Mamba-360: Survey of state space models as transformer alternative for long sequence modelling: Methods, applications, and challenges

BN Patro, VS Agneeswaran - arxiv preprint arxiv:2404.16112, 2024 - arxiv.org
Sequence modeling is a crucial area across various domains, including Natural Language
Processing (NLP), speech recognition, time series forecasting, music generation, and …

From large language models to large multimodal models: A literature review

D Huang, C Yan, Q Li, X Peng - Applied Sciences, 2024 - mdpi.com
With the deepening of research on Large Language Models (LLMs), significant progress has
been made in recent years on the development of Large Multimodal Models (LMMs), which …

Transformers are ssms: Generalized models and efficient algorithms through structured state space duality

T Dao, A Gu - arxiv preprint arxiv:2405.21060, 2024 - arxiv.org
While Transformers have been the main architecture behind deep learning's success in
language modeling, state-space models (SSMs) such as Mamba have recently been shown …

xlstm: Extended long short-term memory

M Beck, K Pöppel, M Spanring, A Auer… - arxiv preprint arxiv …, 2024 - arxiv.org
In the 1990s, the constant error carousel and gating were introduced as the central ideas of
the Long Short-Term Memory (LSTM). Since then, LSTMs have stood the test of time and …

Learning to (learn at test time): Rnns with expressive hidden states

Y Sun, X Li, K Dalal, J Xu, A Vikram, G Zhang… - arxiv preprint arxiv …, 2024 - arxiv.org
Self-attention performs well in long context but has quadratic complexity. Existing RNN
layers have linear complexity, but their performance in long context is limited by the …

An empirical study of mamba-based language models

R Waleffe, W Byeon, D Riach, B Norick… - arxiv preprint arxiv …, 2024 - arxiv.org
Selective state-space models (SSMs) like Mamba overcome some of the shortcomings of
Transformers, such as quadratic computational complexity with sequence length and large …

The mamba in the llama: Distilling and accelerating hybrid models

J Wang, D Paliotta, A May, A Rush… - Advances in Neural …, 2025 - proceedings.neurips.cc
Linear RNN architectures, like Mamba, can be competitive with Transformer models in
language modeling while having advantageous deployment characteristics. Given the focus …

Recurrent neural networks: vanishing and exploding gradients are not the end of the story

N Zucchet, A Orvieto - Advances in Neural Information …, 2025 - proceedings.neurips.cc
Recurrent neural networks (RNNs) notoriously struggle to learn long-term memories,
primarily due to vanishing and exploding gradients. The recent success of state-space …

Mambamixer: Efficient selective state space models with dual token and channel selection

A Behrouz, M Santacatterina, R Zabih - arxiv preprint arxiv:2403.19888, 2024 - arxiv.org
Recent advances in deep learning have mainly relied on Transformers due to their data
dependency and ability to learn at scale. The attention module in these architectures …

Zamba: A compact 7b ssm hybrid model

P Glorioso, Q Anthony, Y Tokpanov… - arxiv preprint arxiv …, 2024 - arxiv.org
In this technical report, we present Zamba, a novel 7B SSM-transformer hybrid model which
achieves competitive performance against leading open-weight models at a comparable …