Google znalac

J Mei, B Dai, A Agarwal, S Vaswani, A Raj… - arxiv preprint arxiv …, 2025 - arxiv.org

We provide a new understanding of the stochastic gradient bandit algorithm by showing that
it converges to a globally optimal policy almost surely using\emph {any} constant learning …

Spremi Citiraj Srodni članci Svih 2 inačica Prikaži kao HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Fast Convergence of Softmax Policy Mirror Ascent

R Asad, R Babanezhad, I Laradji, NL Roux… - arxiv preprint arxiv …, 2024 - arxiv.org

Natural policy gradient (NPG) is a common policy optimization algorithm and can be viewed
as mirror ascent in the space of probabilities. Recently, Vaswani et al.[2021] introduced a …

Spremi Citiraj Srodni članci Svih 3 inačica Prikaži kao HTML

[Free GPT-4]
[DeepSeek]

[PDF] openreview.net

Fast Convergence of Softmax Policy Mirror Ascent for Bandits & Tabular MDPs

R Asad, RB Harikandeh, IH Laradji, N Le Roux… - OPT 2024: Optimization … - openreview.net

We analyze the convergence of a novel policy gradient algorithm (referred to as SPMA) for
multi-armed bandits and tabular Markov decision processes (MDPs). SPMA is an …

Spremi Citiraj Srodni članci Svih 2 inačica Prikaži kao HTML

Stvori obavijest

Citiraj

Napredno pretraživanje

Spremljeno u Moju knjižnicu

Towards principled, practical policy gradient for bandits and tabular mdps

Small steps no more: Global convergence of stochastic gradient bandits for arbitrary learning rates

Fast Convergence of Softmax Policy Mirror Ascent

Fast Convergence of Softmax Policy Mirror Ascent for Bandits & Tabular MDPs