Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Evaluating the reasoning ability of language models (LMs) is complicated by
their extensive parametric world knowledge, where benchmark performance often
reflects factual recall rather than genuine reasoning. Existing datasets and
approaches (e.g., temporal filtering, paraphrasing, adversarial substitution)
cannot cleanly separate the two. We present SynthWorlds, a framework that
disentangles task reasoning complexity from factual knowledge. In SynthWorlds,
we construct parallel corpora representing two worlds with identical
interconnected structure: a real-mapped world, where models may exploit
parametric knowledge, and a synthetic-mapped world, where such knowledge is
meaningless. On top of these corpora, we design two mirrored tasks as case
studies: multi-hop question answering and page navigation, which maintain equal
reasoning difficulty across worlds. Experiments in parametric-only (e.g.,
closed-book QA) and knowledge-augmented (e.g., retrieval-augmented) LM settings
reveal a persistent knowledge advantage gap, defined as the performance boost
models gain from memorized parametric world knowledge. Knowledge acquisition
and integration mechanisms reduce but do not eliminate this gap, highlighting
opportunities for system improvements. Fully automatic and scalable,
SynthWorlds provides a controlled environment for evaluating LMs in ways that
were previously challenging, enabling precise and testable comparisons of
reasoning and memorization.
Authors (7)
Ken Gu
Advait Bhat
Mike A Merrill
Robert West
Xin Liu
Daniel McDuff
+1 more
Submitted
October 28, 2025
Key Contributions
Introduces SynthWorlds, a framework for disentangling task reasoning complexity from factual knowledge in language models. It uses parallel corpora representing two worlds (real-mapped and synthetic-mapped) with identical structures but different knowledge bases, allowing for a cleaner evaluation of reasoning capabilities.
Business Value
Provides a more accurate and reliable way to assess the true reasoning capabilities of AI models, leading to the development of more trustworthy and less brittle AI systems. This is crucial for high-stakes applications.