On statistical rates and provably efficient criteria of latent diffusion transformers (dits)

JYC Hu, W Wu, Z Li, S Pi, Z Song… - Advances in Neural …, 2025 - proceedings.neurips.cc
We investigate the statistical and computational limits of latent Diffusion Transformers (DiTs)
under the low-dimensional linear latent space assumption. Statistically, we study the …

Outlier-efficient hopfield layers for large transformer-based models

JYC Hu, PH Chang, R Luo, HY Chen, W Li… - arxiv preprint arxiv …, 2024 - arxiv.org
We introduce an Outlier-Efficient Modern Hopfield Model (termed $\mathrm {OutEffHop} $)
and use it to address the outlier inefficiency problem of {training} gigantic transformer-based …

Tensor attention training: Provably efficient learning of higher-order transformers

Y Liang, Z Shi, Z Song, Y Zhou - arxiv preprint arxiv:2405.16411, 2024 - arxiv.org
Tensor Attention, a multi-view attention that is able to capture high-order correlations among
multiple modalities, can overcome the representational limitations of classical matrix …

Uniform memory retrieval with larger capacity for modern hopfield models

D Wu, JYC Hu, TY Hsiao, H Liu - arxiv preprint arxiv:2404.03827, 2024 - arxiv.org
We propose a two-stage memory retrieval dynamics for modern Hopfield models, termed
$\mathtt {U\text {-} Hop} $, with enhanced memory capacity. Our key contribution is a …

On computational limits of modern hopfield models: A fine-grained complexity analysis

JYC Hu, T Lin, Z Song, H Liu - arxiv preprint arxiv:2402.04520, 2024 - arxiv.org
We investigate the computational limits of the memory retrieval dynamics of modern Hopfield
models from the fine-grained complexity analysis. Our key contribution is the …

Multi-layer transformers gradient can be approximated in almost linear time

Y Liang, Z Sha, Z Shi, Z Song, Y Zhou - arxiv preprint arxiv:2408.13233, 2024 - arxiv.org
The computational complexity of the self-attention mechanism in popular transformer
architectures poses significant challenges for training and inference, and becomes the …

Hsr-enhanced sparse attention acceleration

B Chen, Y Liang, Z Sha, Z Shi, Z Song - arxiv preprint arxiv:2410.10165, 2024 - arxiv.org
Large Language Models (LLMs) have demonstrated remarkable capabilities across various
applications, but their performance on long-context tasks is often limited by the …

Out-of-distribution generalization via composition: a lens through induction heads in transformers

J Song, Z Xu, Y Zhong - Proceedings of the National Academy of Sciences, 2025 - pnas.org
Large language models (LLMs) such as GPT-4 sometimes appear to be creative, solving
novel tasks often with a few demonstrations in the prompt. These tasks require the models to …

The closeness of in-context learning and weight shifting for softmax regression

S Li, Z Song, Y **a, T Yu, T Zhou - arxiv preprint arxiv:2304.13276, 2023 - arxiv.org
Large language models (LLMs) are known for their exceptional performance in natural
language processing, making them highly effective in many human life-related or even job …