Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: In reinforcement learning, we typically refer to unsupervised pre-training
when we aim to pre-train a policy without a priori access to the task
specification, i.e. rewards, to be later employed for efficient learning of
downstream tasks. In single-agent settings, the problem has been extensively
studied and mostly understood. A popular approach, called task-agnostic
exploration, casts the unsupervised objective as maximizing the entropy of the
state distribution induced by the agent's policy, from which principles and
methods follow.
In contrast, little is known about it in multi-agent settings, which are
ubiquitous in the real world. What are the pros and cons of alternative problem
formulations in this setting? How hard is the problem in theory, how can we
solve it in practice? In this paper, we address these questions by first
characterizing those alternative formulations and highlighting how the problem,
even when tractable in theory, is non-trivial in practice. Then, we present a
scalable, decentralized, trust-region policy search algorithm to address the
problem in practical settings. Finally, we provide numerical validations to
both corroborate the theoretical findings and pave the way for unsupervised
multi-agent reinforcement learning via task-agnostic exploration in challenging
domains, showing that optimizing for a specific objective, namely mixture
entropy, provides an excellent trade-off between tractability and performances.
Authors (3)
Riccardo Zamboni
Mirco Mutti
Marcello Restelli
Submitted
February 12, 2025
Key Contributions
This paper addresses the under-explored problem of unsupervised multi-agent reinforcement learning by characterizing alternative problem formulations and analyzing their theoretical tractability and practical challenges. It aims to provide a principled understanding of how to pre-train policies in multi-agent settings without explicit reward signals, which is crucial for efficient learning of subsequent tasks.
Business Value
Enables more efficient training of multi-agent systems in complex environments where explicit reward design is difficult, potentially leading to more capable autonomous systems in areas like logistics or swarm robotics.