arxiv_ml 90% Match Research Paper AI safety researchers,LLM developers,AI ethicists,Researchers in multi-agent systems 1 week ago

Towards Scalable Oversight with Collaborative Multi-Agent Debate in Error Detection

large-language-models › evaluation

📄 Abstract

Abstract: Accurate detection of errors in large language models (LLM) responses is central to the success of scalable oversight, or providing effective supervision to superhuman intelligence. Yet, self-diagnosis is often unreliable on complex tasks unless aided by reliable external feedback. Multi-agent debate (MAD) seems to be a natural alternative to external feedback: multiple LLMs provide complementary perspectives and cross-checks for error detection. However, prior MAD protocols frame debate as a zero-sum game, where the debaters compete to win the game instead of seeking the truth. Consequently, it leads to debate hacking: debaters tend to mislead the judge by misinterpreting the task or presenting overconfident claims, which introduce more mistakes and underperform single-agent methods. To mitigate the issue, we introduce a new collaborative MAD protocol, termed ColMAD, that reframes MAD as a non-zero sum game. Specifically, ColMAD encourages multiple agents to criticize each other in a supportive way, such that they can complement the missing points of each other. Therefore, the judge agent can make a more informative conclusion based on more comprehensive evidence. Empirically, we show that ColMAD significantly outperforms previous competitive MAD by 19% and brings non-trivial improvements over single-agent methods in error detection.

Authors (5)

Yongqiang Chen

Gang Niu

James Cheng

Bo Han

Masashi Sugiyama

Submitted

October 23, 2025

arXiv Category

cs.LG

arXiv PDF

Key Contributions

Introduces ColMAD, a collaborative Multi-Agent Debate protocol that reframes MAD as a non-zero sum game to mitigate 'debate hacking' in LLM error detection. ColMAD encourages agents to criticize each other constructively, leading to more accurate error detection for scalable oversight.

Business Value

Enables more reliable and scalable methods for evaluating and improving LLMs, which is critical for deploying advanced AI systems safely and effectively. It helps ensure the quality and trustworthiness of AI outputs.

Paper Metadata

Innovation Type

Algorithmic Protocol

Deployment Feasibility

High, as it's a protocol for improving LLM evaluation and supervision, applicable to LLM development pipelines.

Limitations Addressed

The unreliability of LLM self-diagnosis on complex tasks, and the 'debate hacking' issue in prior MAD protocols where agents compete to win rather than seek truth, leading to more errors.

Technical Tags

Scalable OversightLLM Error DetectionMulti-Agent Debate (MAD)Collaborative MAD (ColMAD)Debate HackingTruth-seekingSupervisionDebater AgentsJudge Agent

Research Topics

AI AlignmentLLM EvaluationScalable SupervisionMulti-Agent Systems

Methods & Architectures

Collaborative Multi-Agent Debate (ColMAD)Non-zero sum game framingAgent Debate Protocols Large Language Models (LLMs)Multi-Agent Systems

Applications & Tasks

AI Safety LLM Development AI Supervision LLM Error DetectionScalable OversightDebate Hacking Mitigation Accurate detection of LLM errorsProviding effective supervision to AI

Related Fields

Artificial Intelligence SafetyNatural Language ProcessingMulti-Agent SystemsGame Theory

Keywords

LLMscalable oversighterror detectionmulti-agent debateAI alignmentsupervisiondebate hackingcollaborative AIAI safetyevaluationColMAD

Academic Context

#AI Alignment#LLM Evaluation#Scalable Supervision#Multi-Agent Systems

Commercial Potential

Potential Products

LLM evaluation platformsAI supervision toolsAutomated quality assurance for LLMs

Target Industries

TechnologyAI DevelopmentResearch

Use Case Examples

Developing a system to automatically detect and correct errors in LLM-generated contentCreating a framework for supervising advanced AI systemsImproving the reliability of LLM outputs for critical applications

Competitive Edge

Addresses the critical problem of scalable oversight by proposing a novel collaborative multi-agent debate protocol that overcomes the limitations of previous debate mechanisms, specifically 'debate hacking'.

Resource Requirements

Scalability

Designed for scalable oversight.

Production Readiness

Maturity Level

Research

View Full Paper Back to Papers