Optimizing Attention with Mirror Descent: Generalized Max-Margin Token Selection

AAK Julistiono, DA Tarzanagh, N Azizan - arxiv preprint arxiv:2410.14581, 2024 - arxiv.org
Attention mechanisms have revolutionized several domains of artificial intelligence, such as
natural language processing and computer vision, by enabling models to selectively focus …

Training Dynamics of In-Context Learning in Linear Attention

Y Zhang, AK Singh, PE Latham, A Saxe - arxiv preprint arxiv:2501.16265, 2025 - arxiv.org
While attention-based models have demonstrated the remarkable ability of in-context
learning, the theoretical understanding of how these models acquired this ability through …

On the Learn-to-Optimize Capabilities of Transformers in In-Context Sparse Recovery

R Liu, R Zhou, C Shen, J Yang - arxiv preprint arxiv:2410.13981, 2024 - arxiv.org
An intriguing property of the Transformer is its ability to perform in-context learning (ICL),
where the Transformer can solve different inference tasks without parameter updating based …