Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: RAG (Retrieval-Augmented Generation) systems and web agents are increasingly
evaluated on multi-hop deep search tasks, yet current practice suffers from two
major limitations. First, most benchmarks leak the reasoning path in the
question text, allowing models to follow surface cues rather than discover
reasoning chains autonomously. Second, evaluation is typically reduced to a
single pass rate, which collapses diverse behaviours into one score and
obscures whether failures stem from inadequate search, poor knowledge use, or
inappropriate refusal. To address these issues, we present WebDetective, a
benchmark of hint-free multi-hop questions paired with a controlled Wikipedia
sandbox that ensures full traceability of model actions, and a holistic
evaluation framework that separates search sufficiency, knowledge utilisation,
and refusal behaviour. Our evaluation of 25 state-of-the-art models reveals
systematic weaknesses across all architectures: models struggle with knowledge
utilisation despite having sufficient evidence and demonstrate near-absent
appropriate refusal when evidence is lacking. These patterns expose a
fundamental gap: today's systems excel at executing given reasoning paths but
fail when required to discover them. We develop an agentic workflow,
EvidenceLoop, that explicitly targets the challenges our benchmark identifies,
incorporating verification loops and systematic evidence tracking that improve
both search and synthesis capabilities. This baseline demonstrates that
WebDetective's diagnostic framework can guide concrete architectural
improvements, establishing our benchmark as a critical tool for developing
genuinely autonomous reasoning systems rather than pattern-following agents.
Key Contributions
Introduces WebDetective, a benchmark for evaluating LLMs and web agents on hint-free multi-hop deep search tasks. It features a controlled environment and a holistic evaluation framework that separates search sufficiency, knowledge utilization, and refusal behavior, addressing limitations of current benchmarks and evaluation metrics.
Business Value
Enables more accurate and reliable evaluation of AI systems designed for complex information retrieval and reasoning tasks, leading to better development of search engines, AI assistants, and autonomous agents.