Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Accurate detection of errors in large language models (LLM) responses is
central to the success of scalable oversight, or providing effective
supervision to superhuman intelligence. Yet, self-diagnosis is often unreliable
on complex tasks unless aided by reliable external feedback. Multi-agent debate
(MAD) seems to be a natural alternative to external feedback: multiple LLMs
provide complementary perspectives and cross-checks for error detection.
However, prior MAD protocols frame debate as a zero-sum game, where the
debaters compete to win the game instead of seeking the truth. Consequently, it
leads to debate hacking: debaters tend to mislead the judge by misinterpreting
the task or presenting overconfident claims, which introduce more mistakes and
underperform single-agent methods. To mitigate the issue, we introduce a new
collaborative MAD protocol, termed ColMAD, that reframes MAD as a non-zero sum
game. Specifically, ColMAD encourages multiple agents to criticize each other
in a supportive way, such that they can complement the missing points of each
other. Therefore, the judge agent can make a more informative conclusion based
on more comprehensive evidence. Empirically, we show that ColMAD significantly
outperforms previous competitive MAD by 19% and brings non-trivial improvements
over single-agent methods in error detection.
Authors (5)
Yongqiang Chen
Gang Niu
James Cheng
Bo Han
Masashi Sugiyama
Submitted
October 23, 2025
Key Contributions
Introduces ColMAD, a collaborative Multi-Agent Debate protocol that reframes MAD as a non-zero sum game to mitigate 'debate hacking' in LLM error detection. ColMAD encourages agents to criticize each other constructively, leading to more accurate error detection for scalable oversight.
Business Value
Enables more reliable and scalable methods for evaluating and improving LLMs, which is critical for deploying advanced AI systems safely and effectively. It helps ensure the quality and trustworthiness of AI outputs.