arxiv_cl 95% Match Research Paper AI Researchers,LLM Developers,Web Agent Developers,Information Retrieval Specialists 4 weeks ago

Demystifying deep search: a holistic evaluation with hint-free multi-hop questions and factorised metrics

large-language-models › evaluation

📄 Abstract

Abstract: RAG (Retrieval-Augmented Generation) systems and web agents are increasingly evaluated on multi-hop deep search tasks, yet current practice suffers from two major limitations. First, most benchmarks leak the reasoning path in the question text, allowing models to follow surface cues rather than discover reasoning chains autonomously. Second, evaluation is typically reduced to a single pass rate, which collapses diverse behaviours into one score and obscures whether failures stem from inadequate search, poor knowledge use, or inappropriate refusal. To address these issues, we present WebDetective, a benchmark of hint-free multi-hop questions paired with a controlled Wikipedia sandbox that ensures full traceability of model actions, and a holistic evaluation framework that separates search sufficiency, knowledge utilisation, and refusal behaviour. Our evaluation of 25 state-of-the-art models reveals systematic weaknesses across all architectures: models struggle with knowledge utilisation despite having sufficient evidence and demonstrate near-absent appropriate refusal when evidence is lacking. These patterns expose a fundamental gap: today's systems excel at executing given reasoning paths but fail when required to discover them. We develop an agentic workflow, EvidenceLoop, that explicitly targets the challenges our benchmark identifies, incorporating verification loops and systematic evidence tracking that improve both search and synthesis capabilities. This baseline demonstrates that WebDetective's diagnostic framework can guide concrete architectural improvements, establishing our benchmark as a critical tool for developing genuinely autonomous reasoning systems rather than pattern-following agents.

Key Contributions

Introduces WebDetective, a benchmark for evaluating LLMs and web agents on hint-free multi-hop deep search tasks. It features a controlled environment and a holistic evaluation framework that separates search sufficiency, knowledge utilization, and refusal behavior, addressing limitations of current benchmarks and evaluation metrics.

Business Value

Enables more accurate and reliable evaluation of AI systems designed for complex information retrieval and reasoning tasks, leading to better development of search engines, AI assistants, and autonomous agents.

Paper Metadata

Innovation Type

Benchmark and Evaluation Framework

Deployment Feasibility

High, as it provides a methodology and benchmark for evaluation.

Limitations Addressed

Current benchmarks often leak reasoning paths in questions, hindering autonomous discovery. Evaluation is typically reduced to a single metric, obscuring specific failure modes. WebDetective provides hint-free questions and a factorized evaluation to address these.

Technical Tags

Deep SearchMulti-hop QuestionsHint-FreeRAGWeb AgentsBenchmarkEvaluation FrameworkSearch SufficiencyKnowledge UtilizationRefusal BehaviorWebDetective

Research Topics

Language Model EvaluationRetrieval-Augmented GenerationWeb AgentsMulti-hop ReasoningBenchmark Design

Methods & Architectures

Benchmark Creation (WebDetective)Development of hint-free multi-hop questionsControlled Wikipedia sandbox environmentHolistic evaluation framework (separating search, knowledge use, refusal) RAG systemsWeb agents

Applications & Tasks

Information Retrieval Question Answering Web Search AI Agents Evaluating LLMs on complex multi-hop reasoningAvoiding benchmark leakageDeveloping holistic evaluation metrics Multi-hop deep searchAutonomous reasoningKnowledge utilizationInformation retrieval

Datasets & Benchmarks

Benchmarks

WebDetective

Search sufficiencyKnowledge utilizationRefusal behaviorPass rate

Related Fields

Artificial IntelligenceNatural Language ProcessingInformation RetrievalMachine LearningWeb Science

Keywords

Deep SearchMulti-hop ReasoningRAGWeb AgentsBenchmarkEvaluationHint-FreeKnowledge UtilizationWebDetectiveAI Search

Academic Context

#Language Model Evaluation#Retrieval-Augmented Generation#Web Agents#Multi-hop Reasoning#Benchmark Design

Commercial Potential

Potential Products

Advanced search enginesMore capable AI assistantsTools for evaluating complex reasoning systems

Target Industries

TechnologyInformation ServicesSearch EnginesAI Development

Use Case Examples

Developing AI that can answer complex research questionsBuilding autonomous agents for information gatheringImproving the accuracy of web search results

Competitive Edge

Offers a more rigorous and nuanced evaluation methodology for deep search tasks compared to existing benchmarks.

Market Opportunity

Growing complexity of information retrieval needs,Demand for advanced AI reasoning capabilities

Resource Requirements

Compute Needs

Moderate to high for running evaluations on benchmark tasks.

Data Requirements

The WebDetective benchmark dataset.

Scalability

The benchmark design allows for extensibility to new domains and question types.

Production Readiness

Maturity Level

Research

View Full Paper Back to Papers