Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_cl 95% Match Research Paper AI researchers,ML engineers,AI safety researchers 1 week ago

SynthWorlds: Controlled Parallel Worlds for Disentangling Reasoning and Knowledge in Language Models

large-language-models › evaluation
📄 Abstract

Abstract: Evaluating the reasoning ability of language models (LMs) is complicated by their extensive parametric world knowledge, where benchmark performance often reflects factual recall rather than genuine reasoning. Existing datasets and approaches (e.g., temporal filtering, paraphrasing, adversarial substitution) cannot cleanly separate the two. We present SynthWorlds, a framework that disentangles task reasoning complexity from factual knowledge. In SynthWorlds, we construct parallel corpora representing two worlds with identical interconnected structure: a real-mapped world, where models may exploit parametric knowledge, and a synthetic-mapped world, where such knowledge is meaningless. On top of these corpora, we design two mirrored tasks as case studies: multi-hop question answering and page navigation, which maintain equal reasoning difficulty across worlds. Experiments in parametric-only (e.g., closed-book QA) and knowledge-augmented (e.g., retrieval-augmented) LM settings reveal a persistent knowledge advantage gap, defined as the performance boost models gain from memorized parametric world knowledge. Knowledge acquisition and integration mechanisms reduce but do not eliminate this gap, highlighting opportunities for system improvements. Fully automatic and scalable, SynthWorlds provides a controlled environment for evaluating LMs in ways that were previously challenging, enabling precise and testable comparisons of reasoning and memorization.
Authors (7)
Ken Gu
Advait Bhat
Mike A Merrill
Robert West
Xin Liu
Daniel McDuff
+1 more
Submitted
October 28, 2025
arXiv Category
cs.CL
arXiv PDF

Key Contributions

Introduces SynthWorlds, a framework for disentangling task reasoning complexity from factual knowledge in language models. It uses parallel corpora representing two worlds (real-mapped and synthetic-mapped) with identical structures but different knowledge bases, allowing for a cleaner evaluation of reasoning capabilities.

Business Value

Provides a more accurate and reliable way to assess the true reasoning capabilities of AI models, leading to the development of more trustworthy and less brittle AI systems. This is crucial for high-stakes applications.