Proportional response: Contextual bandits for simple and cumulative regret minimization

SK Krishnamurthy, R Zhan, S Athey… - Advances in Neural …, 2023 - proceedings.neurips.cc
In many applications, eg in healthcare and e-commerce, the goal of a contextual bandit may
be to learn an optimal treatment assignment policy at the end of the experiment. That is, to …

On sample-efficient offline reinforcement learning: Data diversity, posterior sampling and beyond

T Nguyen-Tang, R Arora - Advances in neural information …, 2023 - proceedings.neurips.cc
We seek to understand what facilitates sample-efficient learning from historical datasets for
sequential decision-making, a problem that is popularly known as offline reinforcement …

Minimax-optimal reward-agnostic exploration in reinforcement learning

G Li, Y Yan, Y Chen, J Fan - The Thirty Seventh Annual …, 2024 - proceedings.mlr.press
This paper studies reward-agnostic exploration in reinforcement learning (RL)—a scenario
where the learner is unware of the reward functions during the exploration stage—and …

Positivity-free policy learning with observational data

P Zhao, A Chambaz, J Josse… - … Conference on Artificial …, 2024 - proceedings.mlr.press
Policy learning utilizing observational data is pivotal across various domains, with the
objective of learning the optimal treatment assignment policy while adhering to specific …

Oracle-efficient pessimism: Offline policy optimization in contextual bandits

L Wang, A Krishnamurthy… - … Conference on Artificial …, 2024 - proceedings.mlr.press
We consider offline policy optimization (OPO) in contextual bandits, where one is given a
fixed dataset of logged interactions. While pessimistic regularizers are typically used to …

Dual active learning for reinforcement learning from human feedback

P Liu, C Shi, WW Sun - arxiv preprint arxiv:2410.02504, 2024 - arxiv.org
Aligning large language models (LLMs) with human preferences is critical to recent
advances in generative artificial intelligence. Reinforcement learning from human feedback …

Importance-weighted offline learning done right

G Gabbianelli, G Neu, M Papini - … Conference on Algorithmic …, 2024 - proceedings.mlr.press
We study the problem of offline policy optimization in stochastic contextual bandit problems,
where the goal is to learn a near-optimal policy based on a dataset of decision data …

Off-policy estimation with adaptively collected data: the power of online learning

J Lee, C Ma - Advances in Neural Information Processing …, 2025 - proceedings.neurips.cc
We consider estimation of a linear functional of the treatment effect from adaptively collected
data. This problem finds a variety of applications including off-policy evaluation in contextual …

Individualized policy evaluation and learning under clustered network interference

Y Zhang, K Imai - arxiv preprint arxiv:2311.02467, 2023 - arxiv.org
While there now exists a large literature on policy evaluation and learning, much of prior
work assumes that the treatment assignment of one unit does not affect the outcome of …

Robust Offline Policy Learning with Observational Data from Multiple Sources

AG Carranza, S Athey - arxiv preprint arxiv:2410.08537, 2024 - arxiv.org
We consider the problem of using observational bandit feedback data from multiple
heterogeneous data sources to learn a personalized decision policy that robustly …