Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: This dissertation investigates how reinforcement learning (RL) methods can be
designed to be safe, sample-efficient, and robust. Framed through the unifying
perspective of contextual-bandit RL, the work addresses two major application
domains - ranking and recommendation, and text-to-image diffusion models. The
first part of the thesis develops theory and algorithms for safe deployment in
ranking systems. An exposure-based generalisation bound is derived, leading to
a counterfactual risk-minimisation objective whose solution is guaranteed not
to underperform the logging policy, even with sparse feedback. This guarantee
is extended to doubly robust estimators, enabling safety even under adversarial
or misspecified user models and offering practitioners explicit control over
permissible utility loss. The second part turns to single-action bandits, where
various off-policy estimators are unified within a baseline-correction
framework. A closed-form optimal baseline is proposed and shown to minimise
both evaluation and policy-gradient variance, thereby improving off-policy
learning reliability. The final part examines the trade-offs between efficiency
and effectiveness in generative RL. A systematic study of PPO and REINFORCE
motivates the Leave-One-Out PPO (LOOP) algorithm, which combines multiple
diffusion trajectories with a REINFORCE-style baseline inside PPO's clipped
objective. LOOP achieves PPO-level sample efficiency while producing
generations that align more faithfully with textual attributes.
Submitted
October 17, 2025
Key Contributions
This dissertation develops theory and algorithms for safe, sample-efficient, and robust RL, framed within contextual bandits. It proposes a counterfactual risk-minimization objective for safe ranking systems (guaranteed not to underperform logging policy) and extends this to DR estimators. For single-action bandits, it unifies off-policy estimators via a baseline-correction framework with a closed-form optimal baseline.
Business Value
Enables the safe and reliable deployment of RL in high-stakes applications like recommendation systems and content generation, reducing risks and improving user experience and model performance.