Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Diffusion policies are competitive for offline reinforcement learning (RL)
but are typically guided at sampling time by heuristics that lack a statistical
notion of risk. We introduce LRT-Diffusion, a risk-aware sampling rule that
treats each denoising step as a sequential hypothesis test between the
unconditional prior and the state-conditional policy head. Concretely, we
accumulate a log-likelihood ratio and gate the conditional mean with a logistic
controller whose threshold tau is calibrated once under H0 to meet a
user-specified Type-I level alpha. This turns guidance from a fixed push into
an evidence-driven adjustment with a user-interpretable risk budget.
Importantly, we deliberately leave training vanilla (two heads with standard
epsilon-prediction) under the structure of DDPM. LRT guidance composes
naturally with Q-gradients: critic-gradient updates can be taken at the
unconditional mean, at the LRT-gated mean, or a blend, exposing a continuum
from exploitation to conservatism. We standardize states and actions
consistently at train and test time and report a state-conditional
out-of-distribution (OOD) metric alongside return. On D4RL MuJoCo tasks,
LRT-Diffusion improves the return-OOD trade-off over strong Q-guided baselines
in our implementation while honoring the desired alpha. Theoretically, we
establish level-alpha calibration, concise stability bounds, and a return
comparison showing when LRT surpasses Q-guidance-especially when off-support
errors dominate. Overall, LRT-Diffusion is a drop-in, inference-time method
that adds principled, calibrated risk control to diffusion policies for offline
RL.
Submitted
October 28, 2025
Key Contributions
LRT-Diffusion introduces a novel risk-aware sampling rule for diffusion policies in offline RL, treating each step as a hypothesis test. This method calibrates guidance based on a user-specified risk level (Type-I error), providing an evidence-driven adjustment with an interpretable risk budget, and naturally composing with Q-gradients.
Business Value
Enables the development of safer and more reliable autonomous systems by providing a principled way to manage risk during policy learning and execution. This is crucial for applications where safety is paramount, such as robotics and autonomous driving.