Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: While conventional wisdom holds that policy gradient methods are better
suited to complex action spaces than action-value methods, foundational work
has shown that the two paradigms are equivalent in small, finite action spaces
(O'Donoghue et al., 2017; Schulman et al., 2017a). This raises the question of
why their computational applicability and performance diverge as the complexity
of the action space increases. We hypothesize that the apparent superiority of
policy gradients in such settings stems not from intrinsic qualities of the
paradigm but from universal principles that can also be applied to action-value
methods, enabling similar functions. We identify three such principles and
provide a framework for incorporating them into action-value methods. To
support our hypothesis, we instantiate this framework in what we term QMLE, for
Q-learning with maximum likelihood estimation. Our results show that QMLE can
be applied to complex action spaces at a computational cost comparable to that
of policy gradient methods, all without using policy gradients. Furthermore,
QMLE exhibits strong performance on the DeepMind Control Suite, even when
compared to state-of-the-art methods such as DMPO and D4PG.