Turnitin
降AI改写
早检测系统
早降重系统
Turnitin-UK版
万方检测-期刊版
维普编辑部版
Grammarly检测
Paperpass检测
checkpass检测
PaperYY检测
Proportional response: Contextual bandits for simple and cumulative regret minimization
In many applications, eg in healthcare and e-commerce, the goal of a contextual bandit may
be to learn an optimal treatment assignment policy at the end of the experiment. That is, to …
be to learn an optimal treatment assignment policy at the end of the experiment. That is, to …
On sample-efficient offline reinforcement learning: Data diversity, posterior sampling and beyond
We seek to understand what facilitates sample-efficient learning from historical datasets for
sequential decision-making, a problem that is popularly known as offline reinforcement …
sequential decision-making, a problem that is popularly known as offline reinforcement …
Minimax-optimal reward-agnostic exploration in reinforcement learning
This paper studies reward-agnostic exploration in reinforcement learning (RL)—a scenario
where the learner is unware of the reward functions during the exploration stage—and …
where the learner is unware of the reward functions during the exploration stage—and …
Positivity-free policy learning with observational data
Policy learning utilizing observational data is pivotal across various domains, with the
objective of learning the optimal treatment assignment policy while adhering to specific …
objective of learning the optimal treatment assignment policy while adhering to specific …
Oracle-efficient pessimism: Offline policy optimization in contextual bandits
L Wang, A Krishnamurthy… - … Conference on Artificial …, 2024 - proceedings.mlr.press
We consider offline policy optimization (OPO) in contextual bandits, where one is given a
fixed dataset of logged interactions. While pessimistic regularizers are typically used to …
fixed dataset of logged interactions. While pessimistic regularizers are typically used to …
Dual active learning for reinforcement learning from human feedback
Aligning large language models (LLMs) with human preferences is critical to recent
advances in generative artificial intelligence. Reinforcement learning from human feedback …
advances in generative artificial intelligence. Reinforcement learning from human feedback …
Importance-weighted offline learning done right
We study the problem of offline policy optimization in stochastic contextual bandit problems,
where the goal is to learn a near-optimal policy based on a dataset of decision data …
where the goal is to learn a near-optimal policy based on a dataset of decision data …
Off-policy estimation with adaptively collected data: the power of online learning
J Lee, C Ma - Advances in Neural Information Processing …, 2025 - proceedings.neurips.cc
We consider estimation of a linear functional of the treatment effect from adaptively collected
data. This problem finds a variety of applications including off-policy evaluation in contextual …
data. This problem finds a variety of applications including off-policy evaluation in contextual …
Individualized policy evaluation and learning under clustered network interference
While there now exists a large literature on policy evaluation and learning, much of prior
work assumes that the treatment assignment of one unit does not affect the outcome of …
work assumes that the treatment assignment of one unit does not affect the outcome of …
Robust Offline Policy Learning with Observational Data from Multiple Sources
We consider the problem of using observational bandit feedback data from multiple
heterogeneous data sources to learn a personalized decision policy that robustly …
heterogeneous data sources to learn a personalized decision policy that robustly …