Massive activations in large language models

M Sun, X Chen, JZ Kolter, Z Liu - arxiv preprint arxiv:2402.17762, 2024 - arxiv.org
We observe an empirical phenomenon in Large Language Models (LLMs)--very few
activations exhibit significantly larger values than others (eg, 100,000 times larger). We call …

Unveil benign overfitting for transformer in vision: Training dynamics, convergence, and generalization

J Jiang, W Huang, M Zhang… - Advances in Neural …, 2025 - proceedings.neurips.cc
Transformers have demonstrated great power in the recent development of large
foundational models. In particular, the Vision Transformer (ViT) has brought revolutionary …