arxiv_cl 95% Match Research Paper AI researchers,ML engineers,AI safety researchers 1 week ago

SynthWorlds: Controlled Parallel Worlds for Disentangling Reasoning and Knowledge in Language Models

large-language-models › evaluation

📄 Abstract

Abstract: Evaluating the reasoning ability of language models (LMs) is complicated by their extensive parametric world knowledge, where benchmark performance often reflects factual recall rather than genuine reasoning. Existing datasets and approaches (e.g., temporal filtering, paraphrasing, adversarial substitution) cannot cleanly separate the two. We present SynthWorlds, a framework that disentangles task reasoning complexity from factual knowledge. In SynthWorlds, we construct parallel corpora representing two worlds with identical interconnected structure: a real-mapped world, where models may exploit parametric knowledge, and a synthetic-mapped world, where such knowledge is meaningless. On top of these corpora, we design two mirrored tasks as case studies: multi-hop question answering and page navigation, which maintain equal reasoning difficulty across worlds. Experiments in parametric-only (e.g., closed-book QA) and knowledge-augmented (e.g., retrieval-augmented) LM settings reveal a persistent knowledge advantage gap, defined as the performance boost models gain from memorized parametric world knowledge. Knowledge acquisition and integration mechanisms reduce but do not eliminate this gap, highlighting opportunities for system improvements. Fully automatic and scalable, SynthWorlds provides a controlled environment for evaluating LMs in ways that were previously challenging, enabling precise and testable comparisons of reasoning and memorization.

Authors (7)

Ken Gu

Advait Bhat

Mike A Merrill

Robert West

Xin Liu

Daniel McDuff

+1 more

Submitted

October 28, 2025

arXiv Category

cs.CL

arXiv PDF

Key Contributions

Introduces SynthWorlds, a framework for disentangling task reasoning complexity from factual knowledge in language models. It uses parallel corpora representing two worlds (real-mapped and synthetic-mapped) with identical structures but different knowledge bases, allowing for a cleaner evaluation of reasoning capabilities.

Business Value

Provides a more accurate and reliable way to assess the true reasoning capabilities of AI models, leading to the development of more trustworthy and less brittle AI systems. This is crucial for high-stakes applications.

Paper Metadata

Innovation Type

Framework/Methodology

Deployment Feasibility

High, as it's an evaluation framework and methodology, not a deployable model itself.

Limitations Addressed

The confounding effect of parametric world knowledge on evaluating the reasoning abilities of language models.

Technical Tags

reasoning evaluationknowledge disentanglementlanguage modelsparallel corporasynthetic worldsfactual knowledgemulti-hop QAretrieval-augmented LM

Research Topics

Disentangling Reasoning from Knowledge in LLMsEvaluating LLM Reasoning CapabilitiesSynthetic Data for AI EvaluationControlling World Knowledge in LMs

Methods & Architectures

SynthWorlds frameworkConstruction of parallel corpora (real-mapped vs. synthetic-mapped worlds)Mirrored tasks (multi-hop QA, page navigation)Evaluation in parametric-only and knowledge-augmented LM settings

Applications & Tasks

AI Evaluation Natural Language Processing Machine Learning Research Difficulty in separating LLM reasoning from parametric world knowledgeBenchmark performance often reflects factual recall, not genuine reasoningExisting datasets cannot cleanly separate reasoning and knowledge Evaluating reasoning ability of LMsDisentangling knowledge and reasoning

Related Fields

Machine LearningArtificial IntelligenceNatural Language UnderstandingAI Ethics

Keywords

ReasoningKnowledgeLanguage ModelsEvaluationDisentanglementSynthetic DataBenchmarkWorld KnowledgeMulti-hop QAParametric KnowledgeSynthWorlds

Academic Context

#Disentangling Reasoning from Knowledge in LLMs#Evaluating LLM Reasoning Capabilities#Synthetic Data for AI Evaluation#Controlling World Knowledge in LMs

Commercial Potential

Target Industries

AI ResearchTechnology Development

Use Case Examples

Assessing the reasoning skills of advanced AI assistantsDeveloping AI systems that can learn and reason independently of factual recall

Competitive Edge

Offers a novel approach to disentangling reasoning and knowledge, overcoming limitations of existing methods that cannot cleanly separate these two aspects.

Resource Requirements

Compute Needs

Moderate (for running evaluations)

Data Requirements

Requires generation of parallel corpora based on the SynthWorlds framework.

Scalability

The framework is designed to be scalable for generating diverse synthetic worlds and tasks.

Production Readiness

Maturity Level

Research

View Full Paper Back to Papers