Parallelizing linear transformers with the delta rule over sequence length

S Yang, B Wang, Y Zhang, Y Shen, Y Kim - arxiv preprint arxiv …, 2024 - arxiv.org
Transformers with linear attention (ie, linear transformers) and state-space models have
recently been suggested as a viable linear-time alternative to transformers with softmax …

ARFlow: Autogressive Flow with Hybrid Linear Attention

M Hui, RJ Zhu, S Yang, Y Zhang, Z Wang… - arxiv preprint arxiv …, 2025 - arxiv.org
Flow models are effective at progressively generating realistic images, but they generally
struggle to capture long-range dependencies during the generation process as they …

CLEAR: Conv-Like Linearization Revs Pre-Trained Diffusion Transformers Up

S Liu, Z Tan, X Wang - arxiv preprint arxiv:2412.16112, 2024 - arxiv.org
Diffusion Transformers (DiT) have become a leading architecture in image generation.
However, the quadratic complexity of attention mechanisms, which are responsible for …

Forgetting Transformer: Softmax Attention with a Forget Gate

Z Lin, E Nikishin, X He, A Courville - The Thirteenth International … - openreview.net
An essential component of modern recurrent sequence models is the* forget gate*. While
Transformers do not have an explicit recurrent form, we show that a forget gate can be …

FlashSampling: Fast and Memory-Efficient Exact Sampling with Group-Gumbel-Max

Z Qin, X Shen, Y Zhang, Y Zhong - openreview.net
Sampling operations in discrete space are widely used in different fields such as language
models, reinforcement learning, VAE, GAN, and neural architecture search. Current …