Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: The classical policy gradient method is the theoretical and conceptual
foundation of modern policy-based reinforcement learning (RL) algorithms. Most
rigorous analyses of such methods, particularly those establishing convergence
guarantees, assume a discount factor $\gamma < 1$. In contrast, however, a
recent line of work on policy-based RL for large language models uses the
undiscounted total-reward setting with $\gamma = 1$, rendering much of the
existing theory inapplicable. In this paper, we provide analyses of the policy
gradient method for undiscounted expected total-reward infinite-horizon MDPs
based on two key insights: (i) the classification of the MDP states into
recurrent and transient states is invariant over the set of policies that
assign strictly positive probability to every action (as is typical in deep RL
models employing a softmax output layer) and (ii) the classical state
visitation measure (which may be ill-defined when $\gamma = 1$) can be replaced
with a new object that we call the transient visitation measure.
Authors (2)
Jongmin Lee
Ernest K. Ryu
Submitted
October 21, 2025
Key Contributions
This paper provides theoretical analyses explaining why policy gradient methods work for undiscounted total-reward MDPs (gamma=1), a setting relevant for LLMs. It identifies two key insights: the invariance of state classification (recurrent/transient) under common policies and the replacement of the classical state visitation measure with a modified one, enabling convergence guarantees.
Business Value
Provides a stronger theoretical foundation for using policy gradient methods in applications like LLM fine-tuning, potentially leading to more stable and effective training of large-scale AI models.