Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: We present a non-asymptotic convergence analysis of $Q$-learning and
actor-critic algorithms for robust average-reward Markov Decision Processes
(MDPs) under contamination, total-variation (TV) distance, and Wasserstein
uncertainty sets. A key ingredient of our analysis is showing that the optimal
robust $Q$ operator is a strict contraction with respect to a carefully
designed semi-norm (with constant functions quotiented out). This property
enables a stochastic approximation update that learns the optimal robust
$Q$-function using $\tilde{\mathcal{O}}(\epsilon^{-2})$ samples. We also
provide an efficient routine for robust $Q$-function estimation, which in turn
facilitates robust critic estimation. Building on this, we introduce an
actor-critic algorithm that learns an $\epsilon$-optimal robust policy within
$\tilde{\mathcal{O}}(\epsilon^{-2})$ samples. We provide numerical simulations
to evaluate the performance of our algorithms.