- Academic Search

SK Krishnamurthy, R Zhan, S Athey… - Advances in Neural …, 2023 - proceedings.neurips.cc

In many applications, eg in healthcare and e-commerce, the goal of a contextual bandit may
be to learn an optimal treatment assignment policy at the end of the experiment. That is, to …

Spara Citera Citerat av 12 Relaterade artiklar Alla 6 versionerna Se som HTML-version

[Free GPT-4]
[DeepSeek]

[PDF] neurips.cc

On sample-efficient offline reinforcement learning: Data diversity, posterior sampling and beyond

T Nguyen-Tang, R Arora - Advances in neural information …, 2023 - proceedings.neurips.cc

We seek to understand what facilitates sample-efficient learning from historical datasets for
sequential decision-making, a problem that is popularly known as offline reinforcement …

Spara Citera Citerat av 8 Relaterade artiklar Alla 6 versionerna Se som HTML-version

[Free GPT-4]
[DeepSeek]

[PDF] mlr.press

Minimax-optimal reward-agnostic exploration in reinforcement learning

G Li, Y Yan, Y Chen, J Fan - The Thirty Seventh Annual …, 2024 - proceedings.mlr.press

This paper studies reward-agnostic exploration in reinforcement learning (RL)—a scenario
where the learner is unware of the reward functions during the exploration stage—and …

Spara Citera Citerat av 16 Relaterade artiklar Alla 6 versionerna Se som HTML-version

[Free GPT-4]
[DeepSeek]

[PDF] mlr.press

Positivity-free policy learning with observational data

P Zhao, A Chambaz, J Josse… - … Conference on Artificial …, 2024 - proceedings.mlr.press

Policy learning utilizing observational data is pivotal across various domains, with the
objective of learning the optimal treatment assignment policy while adhering to specific …

Spara Citera Citerat av 7 Relaterade artiklar Alla 6 versionerna Se som HTML-version

[Free GPT-4]
[DeepSeek]

[PDF] mlr.press

Oracle-efficient pessimism: Offline policy optimization in contextual bandits

L Wang, A Krishnamurthy… - … Conference on Artificial …, 2024 - proceedings.mlr.press

We consider offline policy optimization (OPO) in contextual bandits, where one is given a
fixed dataset of logged interactions. While pessimistic regularizers are typically used to …

Spara Citera Citerat av 9 Relaterade artiklar Alla 4 versionerna Se som HTML-version

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Dual active learning for reinforcement learning from human feedback

P Liu, C Shi, WW Sun - arxiv preprint arxiv:2410.02504, 2024 - arxiv.org

Aligning large language models (LLMs) with human preferences is critical to recent
advances in generative artificial intelligence. Reinforcement learning from human feedback …

Spara Citera Citerat av 3 Relaterade artiklar Alla 2 versionerna Se som HTML-version

[Free GPT-4]
[DeepSeek]

[PDF] mlr.press

Importance-weighted offline learning done right

G Gabbianelli, G Neu, M Papini - … Conference on Algorithmic …, 2024 - proceedings.mlr.press

We study the problem of offline policy optimization in stochastic contextual bandit problems,
where the goal is to learn a near-optimal policy based on a dataset of decision data …

Spara Citera Citerat av 9 Relaterade artiklar Alla 6 versionerna Se som HTML-version

[Free GPT-4]
[DeepSeek]

[PDF] neurips.cc

Off-policy estimation with adaptively collected data: the power of online learning

J Lee, C Ma - Advances in Neural Information Processing …, 2025 - proceedings.neurips.cc

We consider estimation of a linear functional of the treatment effect from adaptively collected
data. This problem finds a variety of applications including off-policy evaluation in contextual …

Spara Citera Relaterade artiklar Alla 5 versionerna Se som HTML-version

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Individualized policy evaluation and learning under clustered network interference

Y Zhang, K Imai - arxiv preprint arxiv:2311.02467, 2023 - arxiv.org

While there now exists a large literature on policy evaluation and learning, much of prior
work assumes that the treatment assignment of one unit does not affect the outcome of …

Spara Citera Citerat av 4 Relaterade artiklar Alla 4 versionerna Se som HTML-version

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Robust Offline Policy Learning with Observational Data from Multiple Sources

AG Carranza, S Athey - arxiv preprint arxiv:2410.08537, 2024 - arxiv.org

We consider the problem of using observational bandit feedback data from multiple
heterogeneous data sources to learn a personalized decision policy that robustly …

Spara Citera Citerat av 1 Relaterade artiklar Alla 2 versionerna Se som HTML-version

Skapa alarm

Citera

Avancerad sökning

Har sparats i Mitt bibliotek

Policy learning" without" overlap: Pessimism and generalized empirical bernstein's inequality

Proportional response: Contextual bandits for simple and cumulative regret minimization

On sample-efficient offline reinforcement learning: Data diversity, posterior sampling and beyond

Minimax-optimal reward-agnostic exploration in reinforcement learning

Positivity-free policy learning with observational data

Oracle-efficient pessimism: Offline policy optimization in contextual bandits

Dual active learning for reinforcement learning from human feedback

Importance-weighted offline learning done right

Off-policy estimation with adaptively collected data: the power of online learning

Individualized policy evaluation and learning under clustered network interference

Robust Offline Policy Learning with Observational Data from Multiple Sources