Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Building training-ready multi-hop question answering (QA) datasets that truly
stress a model's retrieval and reasoning abilities remains highly challenging
recently. While there have been a few recent evaluation datasets that capture
the characteristics of hard-to-search but easy-to-verify problems -- requiring
the integration of ambiguous, indirect, and cross-domain cues -- these data
resources remain scarce and are mostly designed for evaluation, making them
unsuitable for supervised fine-tuning (SFT) or reinforcement learning (RL).
Meanwhile, manually curating non-trivially retrievable questions -- where
answers cannot be found through a single direct query but instead require
multi-hop reasoning over oblique and loosely connected evidence -- incurs
prohibitive human costs and fails to scale, creating a critical data bottleneck
for training high-capability retrieval-and-reasoning agents.
To address this, we present an automated framework for generating
high-difficulty, training-ready multi-hop questions from semi-structured
knowledge sources. The system (i) grows diverse, logically labeled evidence
clusters through Natural Language Inference (NLI)-based relation typing and
diversity-aware expansion; (ii) applies reverse question construction to
compose oblique cues so that isolated signals are underinformative but their
combination uniquely identifies the target entity; and (iii) enforces quality
with a two-step evaluation pipeline that combines multi-model consensus
filtering with structured constraint decomposition and evidence-based matching.
The result is a scalable process that yields complex, retrieval-resistant yet
verifiable questions suitable for SFT/RL training as well as challenging
evaluation, substantially reducing human curation effort while preserving the
difficulty profile of strong evaluation benchmarks.
Authors (9)
Bingsen Qiu
Zijian Liu
Xiao Liu
Haoshen Yang
Zeren Gao
Bingjie Wang
+3 more
Submitted
October 28, 2025
Key Contributions
Presents BMGQ, an automated framework for generating complex multi-hop reasoning questions from semi-structured data. This addresses the critical data bottleneck for training high-capability retrieval-and-reasoning agents by creating difficult, training-ready datasets that are otherwise prohibitively expensive to curate manually.
Business Value
Accelerates the development of more sophisticated AI systems capable of complex reasoning, which can be applied to areas like advanced search engines, expert systems, and automated research assistants.