Small steps no more: Global convergence of stochastic gradient bandits for arbitrary learning rates

J Mei, B Dai, A Agarwal, S Vaswani, A Raj… - arxiv preprint arxiv …, 2025 - arxiv.org
We provide a new understanding of the stochastic gradient bandit algorithm by showing that
it converges to a globally optimal policy almost surely using\emph {any} constant learning …

Fast Convergence of Softmax Policy Mirror Ascent

R Asad, R Babanezhad, I Laradji, NL Roux… - arxiv preprint arxiv …, 2024 - arxiv.org
Natural policy gradient (NPG) is a common policy optimization algorithm and can be viewed
as mirror ascent in the space of probabilities. Recently, Vaswani et al.[2021] introduced a …

Fast Convergence of Softmax Policy Mirror Ascent for Bandits & Tabular MDPs

R Asad, RB Harikandeh, IH Laradji, N Le Roux… - OPT 2024: Optimization … - openreview.net
We analyze the convergence of a novel policy gradient algorithm (referred to as SPMA) for
multi-armed bandits and tabular Markov decision processes (MDPs). SPMA is an …