Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_ml 90% Match Research Paper Researchers in Reinforcement Learning Theory,ML Engineers working on LLMs,AI Theorists 2 weeks ago

Why Policy Gradient Algorithms Work for Undiscounted Total-Reward MDPs

reinforcement-learning › rlhf
📄 Abstract

Abstract: The classical policy gradient method is the theoretical and conceptual foundation of modern policy-based reinforcement learning (RL) algorithms. Most rigorous analyses of such methods, particularly those establishing convergence guarantees, assume a discount factor $\gamma < 1$. In contrast, however, a recent line of work on policy-based RL for large language models uses the undiscounted total-reward setting with $\gamma = 1$, rendering much of the existing theory inapplicable. In this paper, we provide analyses of the policy gradient method for undiscounted expected total-reward infinite-horizon MDPs based on two key insights: (i) the classification of the MDP states into recurrent and transient states is invariant over the set of policies that assign strictly positive probability to every action (as is typical in deep RL models employing a softmax output layer) and (ii) the classical state visitation measure (which may be ill-defined when $\gamma = 1$) can be replaced with a new object that we call the transient visitation measure.
Authors (2)
Jongmin Lee
Ernest K. Ryu
Submitted
October 21, 2025
arXiv Category
cs.LG
arXiv PDF

Key Contributions

This paper provides theoretical analyses explaining why policy gradient methods work for undiscounted total-reward MDPs (gamma=1), a setting relevant for LLMs. It identifies two key insights: the invariance of state classification (recurrent/transient) under common policies and the replacement of the classical state visitation measure with a modified one, enabling convergence guarantees.

Business Value

Provides a stronger theoretical foundation for using policy gradient methods in applications like LLM fine-tuning, potentially leading to more stable and effective training of large-scale AI models.