Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_ml 50% Match 6 days ago

Q-learning with Posterior Sampling

uncategorized › parse-error
📄 Abstract

Abstract: Bayesian posterior sampling techniques have demonstrated superior empirical performance in many exploration-exploitation settings. However, their theoretical analysis remains a challenge, especially in complex settings like reinforcement learning. In this paper, we introduce Q-Learning with Posterior Sampling (PSQL), a simple Q-learning-based algorithm that uses Gaussian posteriors on Q-values for exploration, akin to the popular Thompson Sampling algorithm in the multi-armed bandit setting. We show that in the tabular episodic MDP setting, PSQL achieves a regret bound of $\tilde O(H^2\sqrt{SAT})$, closely matching the known lower bound of $\Omega(H\sqrt{SAT})$. Here, S, A denote the number of states and actions in the underlying Markov Decision Process (MDP), and $T=KH$ with $K$ being the number of episodes and $H$ being the planning horizon. Our work provides several new technical insights into the core challenges in combining posterior sampling with dynamic programming and TD-learning-based RL algorithms, along with novel ideas for resolving those difficulties. We hope this will form a starting point for analyzing this efficient and important algorithmic technique in even more complex RL settings.
Authors (3)
Priyank Agrawal
Shipra Agrawal
Azmat Azati
Submitted
June 1, 2025
arXiv Category
cs.LG
arXiv PDF

Key Contributions

JSON parse error: Unexpected token i in JSON at position 53128