Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_ml 95% Match Dissertation / Research Monograph Reinforcement learning researchers,AI safety researchers,Machine learning engineers,Researchers in recommender systems 2 weeks ago

Safe, Efficient, and Robust Reinforcement Learning for Ranking and Diffusion Models

reinforcement-learning › offline-rl
📄 Abstract

Abstract: This dissertation investigates how reinforcement learning (RL) methods can be designed to be safe, sample-efficient, and robust. Framed through the unifying perspective of contextual-bandit RL, the work addresses two major application domains - ranking and recommendation, and text-to-image diffusion models. The first part of the thesis develops theory and algorithms for safe deployment in ranking systems. An exposure-based generalisation bound is derived, leading to a counterfactual risk-minimisation objective whose solution is guaranteed not to underperform the logging policy, even with sparse feedback. This guarantee is extended to doubly robust estimators, enabling safety even under adversarial or misspecified user models and offering practitioners explicit control over permissible utility loss. The second part turns to single-action bandits, where various off-policy estimators are unified within a baseline-correction framework. A closed-form optimal baseline is proposed and shown to minimise both evaluation and policy-gradient variance, thereby improving off-policy learning reliability. The final part examines the trade-offs between efficiency and effectiveness in generative RL. A systematic study of PPO and REINFORCE motivates the Leave-One-Out PPO (LOOP) algorithm, which combines multiple diffusion trajectories with a REINFORCE-style baseline inside PPO's clipped objective. LOOP achieves PPO-level sample efficiency while producing generations that align more faithfully with textual attributes.
Authors (1)
Shashank Gupta
Submitted
October 17, 2025
arXiv Category
cs.LG
arXiv PDF

Key Contributions

This dissertation develops theory and algorithms for safe, sample-efficient, and robust RL, framed within contextual bandits. It proposes a counterfactual risk-minimization objective for safe ranking systems (guaranteed not to underperform logging policy) and extends this to DR estimators. For single-action bandits, it unifies off-policy estimators via a baseline-correction framework with a closed-form optimal baseline.

Business Value

Enables the safe and reliable deployment of RL in high-stakes applications like recommendation systems and content generation, reducing risks and improving user experience and model performance.