Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: In real-world information-seeking scenarios, users have dynamic and diverse
needs, requiring RAG systems to demonstrate adaptable resilience. To
comprehensively evaluate the resilience of current RAG methods, we introduce
HawkBench, a human-labeled, multi-domain benchmark designed to rigorously
assess RAG performance across categorized task types. By stratifying tasks
based on information-seeking behaviors, HawkBench provides a systematic
evaluation of how well RAG systems adapt to diverse user needs.
Unlike existing benchmarks, which focus primarily on specific task types
(mostly factoid queries) and rely on varying knowledge bases, HawkBench offers:
(1) systematic task stratification to cover a broad range of query types,
including both factoid and rationale queries, (2) integration of multi-domain
corpora across all task types to mitigate corpus bias, and (3) rigorous
annotation for high-quality evaluation.
HawkBench includes 1,600 high-quality test samples, evenly distributed across
domains and task types. Using this benchmark, we evaluate representative RAG
methods, analyzing their performance in terms of answer quality and response
latency. Our findings highlight the need for dynamic task strategies that
integrate decision-making, query interpretation, and global knowledge
understanding to improve RAG generalizability. We believe HawkBench serves as a
pivotal benchmark for advancing the resilience of RAG methods and their ability
to achieve general-purpose information seeking.