Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Step-by-step reasoning has become a standard approach for large language
models (LLMs) to tackle complex tasks. While this paradigm has proven
effective, it raises a fundamental question: How can we verify that an LLM's
reasoning is accurately grounded in knowledge? To address this question, we
introduce a novel evaluation suite that systematically assesses the knowledge
grounding of intermediate reasoning. Our framework comprises three key
components. (1) Principal Knowledge Collection, a large-scale repository of
atomic knowledge essential for reasoning. Based on the collection, we propose
(2) knowledge-grounded evaluation metrics designed to measure how well models
recall and apply prerequisite knowledge in reasoning. These metrics are
computed by our (3) evaluator LLM, a lightweight model optimized for
cost-effective and reliable metric computation. Our evaluation suite
demonstrates remarkable effectiveness in identifying missing or misapplied
knowledge elements, providing crucial insights for uncovering fundamental
reasoning deficiencies in LLMs. Beyond evaluation, we demonstrate how these
metrics can be integrated into preference optimization, showcasing further
applications of knowledge-grounded evaluation.
Authors (8)
Hyeon Hwang
Yewon Cho
Chanwoong Yoon
Yein Park
Minju Song
Kyungjae Lee
+2 more
Submitted
November 2, 2025
Key Contributions
Introduces a novel evaluation suite to systematically assess the knowledge grounding of intermediate reasoning steps in LLMs. It comprises a Principal Knowledge Collection, knowledge-grounded evaluation metrics, and an evaluator LLM, enabling cost-effective and reliable verification of LLM reasoning accuracy.
Business Value
Enables more reliable and trustworthy deployment of LLMs for complex tasks by providing a robust method to evaluate their reasoning capabilities and ensure they are based on factual knowledge.