A mathematical perspective on transformers

B Geshkovski, C Letrouit, Y Polyanskiy… - arxiv preprint arxiv …, 2023 - arxiv.org
Transformers play a central role in the inner workings of large language models. We
develop a mathematical framework for analyzing Transformers based on their interpretation …

From local structures to size generalization in graph neural networks

G Yehudai, E Fetaya, E Meirom… - International …, 2021 - proceedings.mlr.press
Graph neural networks (GNNs) can process graphs of different sizes, but their ability to
generalize across sizes, specifically from small to large graphs, is still not well understood. In …

Sinkformers: Transformers with doubly stochastic attention

ME Sander, P Ablin, M Blondel… - … Conference on Artificial …, 2022 - proceedings.mlr.press
Attention based models such as Transformers involve pairwise interactions between data
points, modeled with a learnable attention matrix. Importantly, this attention matrix is …

The exact sample complexity gain from invariances for kernel regression

B Tahmasebi, S Jegelka - Advances in Neural Information …, 2023 - proceedings.neurips.cc
In practice, encoding invariances into models improves sample complexity. In this work, we
study this phenomenon from a theoretical perspective. In particular, we provide minimax …

Learning with norm constrained, over-parameterized, two-layer neural networks

F Liu, L Dadi, V Cevher - Journal of Machine Learning Research, 2024 - jmlr.org
Recent studies show that a reproducing kernel Hilbert space (RKHS) is not a suitable space
to model functions by neural networks as the curse of dimensionality (CoD) cannot be …

How smooth is attention?

V Castin, P Ablin, G Peyré - arxiv preprint arxiv:2312.14820, 2023 - arxiv.org
Self-attention and masked self-attention are at the heart of Transformers' outstanding
success. Still, our mathematical understanding of attention, in particular of its Lipschitz …

Universal approximation of symmetric and anti-symmetric functions

J Han, Y Li, L Lin, J Lu, J Zhang, L Zhang - arxiv preprint arxiv …, 2019 - arxiv.org
We consider universal approximations of symmetric and anti-symmetric functions, which are
important for applications in quantum physics, as well as other scientific and engineering …

Deep neural network approximation of invariant functions through dynamical systems

Q Li, T Lin, Z Shen - Journal of Machine Learning Research, 2024 - jmlr.org
We study the approximation of functions which are invariant with respect to certain
permutations of the input indices using flow maps of dynamical systems. Such invariant …

Learning theory of distribution regression with neural networks

Z Shi, Z Yu, DX Zhou - arxiv preprint arxiv:2307.03487, 2023 - arxiv.org
In this paper, we aim at establishing an approximation theory and a learning theory of
distribution regression via a fully connected neural network (FNN). In contrast to the classical …

Deep learning theory of distribution regression with CNNs

Z Yu, DX Zhou - Advances in Computational Mathematics, 2023 - Springer
We establish a deep learning theory for distribution regression with deep convolutional
neural networks (DCNNs). Deep learning based on structured deep neural networks has …