Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Developing Large Language Models (LLMs) to cooperate and compete effectively
within multi-agent systems is a critical step towards more advanced
intelligence. While reinforcement learning (RL) has proven effective for
enhancing reasoning in single-agent tasks, its extension to multi-turn,
multi-agent scenarios remains underexplored due to the challenges of
long-horizon credit assignment and agent-specific advantage estimation. To
address these challenges, we introduce MARS, an end-to-end RL framework that
incentivizes Multi-Agent Reasoning of LLMs through Self-play in both
cooperative and competitive games. MARS features a turn-level advantage
estimator that aligns learning signals with each interaction for credit
assignment, and an agent-specific advantage normalization to stabilize
multi-agent training. By learning with self-play across cooperative and
competitive games, the MARS agent trained from Qwen3-4B develops strong
strategic abilities that generalize to held-out games with up to 28.7%
performance improvements. More importantly, the capability acquired through
self-play generalizes beyond games, yielding consistent performance gains of
multi-agent systems in reasoning benchmarks. When integrated into leading
multi-agent systems, our MARS agent achieves significant performance gains of
10.0% on AIME and 12.5% on GPQA-Diamond. These results establish end-to-end RL
training with self-play in strategic games as a powerful approach for
developing generalizable multi-agent reasoning capabilities in LLMs. Our code
and models are publicly available at https://github.com/thu-nics/MARS.
Authors (13)
Huining Yuan
Zelai Xu
Zheyue Tan
Xiangmin Yi
Mo Guang
Kaiwen Long
+7 more
Submitted
October 17, 2025
Key Contributions
MARS is an end-to-end RL framework that enhances Multi-Agent Reasoning of LLMs through self-play in strategic games. It addresses challenges in MARL with a turn-level advantage estimator for credit assignment and agent-specific advantage normalization for stable training, enabling LLMs to develop strong strategic abilities.
Business Value
Enables the development of more sophisticated multi-agent systems, crucial for applications like autonomous vehicle coordination, complex game AI, and collaborative robotics.