Turnitin
降AI改写
早检测系统
早降重系统
Turnitin-UK版
万方检测-期刊版
维普编辑部版
Grammarly检测
Paperpass检测
checkpass检测
PaperYY检测
Max-margin token selection in attention mechanism
Attention mechanism is a central component of the transformer architecture which led to the
phenomenal success of large language models. However, the theoretical principles …
phenomenal success of large language models. However, the theoretical principles …
Transformers as support vector machines
Since its inception in" Attention Is All You Need", transformer architecture has led to
revolutionary advancements in NLP. The attention layer within the transformer admits a …
revolutionary advancements in NLP. The attention layer within the transformer admits a …
Same pre-training loss, better downstream: Implicit bias matters for language models
Abstract Language modeling on large-scale datasets improves performance of various
downstream tasks. The validation pre-training loss is often used as the evaluation metric for …
downstream tasks. The validation pre-training loss is often used as the evaluation metric for …
Saddle-to-saddle dynamics in diagonal linear networks
In this paper we fully describe the trajectory of gradient flow over $2 $-layer diagonal linear
networks for the regression setting in the limit of vanishing initialisation. We show that the …
networks for the regression setting in the limit of vanishing initialisation. We show that the …
On the implicit bias of initialization shape: Beyond infinitesimal mirror descent
Recent work has highlighted the role of initialization scale in determining the structure of the
solutions that gradient methods converge to. In particular, it was shown that large …
solutions that gradient methods converge to. In particular, it was shown that large …
A precise high-dimensional asymptotic theory for boosting and minimum--norm interpolated classifiers
A precise high-dimensional asymptotic theory for boosting and minimum-l1-norm
interpolated classifiers Page 1 The Annals of Statistics 2022, Vol. 50, No. 3, 1669–1695 …
interpolated classifiers Page 1 The Annals of Statistics 2022, Vol. 50, No. 3, 1669–1695 …
Implicit bias of gradient descent on reparametrized models: On equivalence to mirror descent
As part of the effort to understand implicit bias of gradient descent in overparametrized
models, several results have shown how the training trajectory on the overparametrized …
models, several results have shown how the training trajectory on the overparametrized …
Implicit bias of mirror flow on separable data
We examine the continuous-time counterpart of mirror descent, namely mirror flow, on
classification problems which are linearly separable. Such problems are minimised 'at …
classification problems which are linearly separable. Such problems are minimised 'at …
Reparameterizing mirror descent as gradient descent
Most of the recent successful applications of neural networks have been based on training
with gradient descent updates. However, for some small networks, other mirror descent …
with gradient descent updates. However, for some small networks, other mirror descent …
Convergence rates of gradient methods for convex optimization in the space of measures
L Chizat - Open Journal of Mathematical Optimization, 2022 - numdam.org
We study the convergence rate of Bregman gradient methods for convex optimization in the
space of measures on a d-dimensional manifold. Under basic regularity assumptions, we …
space of measures on a d-dimensional manifold. Under basic regularity assumptions, we …