Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: When language models correctly parse "The cat that the dog chased meowed,"
are they analyzing syntax or simply familiar with dogs chasing cats? Despite
extensive benchmarking, we lack methods to distinguish structural understanding
from semantic pattern matching. We introduce CenterBench, a dataset of 9,720
comprehension questions on center-embedded sentences (like "The cat [that the
dog chased] meowed") where relative clauses nest recursively, creating
processing demands from simple to deeply nested structures. Each sentence has a
syntactically identical but semantically implausible counterpart (e.g., mailmen
prescribe medicine, doctors deliver mail) and six comprehension questions
testing surface understanding, syntactic dependencies, and causal reasoning.
Testing six models reveals that performance gaps between plausible and
implausible sentences widen systematically with complexity, with models showing
median gaps up to 26.8 percentage points, quantifying when they abandon
structural analysis for semantic associations. Notably, semantic plausibility
harms performance on questions about resulting actions, where following causal
relationships matters more than semantic coherence. Reasoning models improve
accuracy but their traces show semantic shortcuts, overthinking, and answer
refusal. Unlike models whose plausibility advantage systematically widens with
complexity, humans shows variable semantic effects. CenterBench provides the
first framework to identify when models shift from structural analysis to
pattern matching.
Authors (3)
Sangmitra Madhusudan
Kaige Chen
Ali Emami
Submitted
October 23, 2025
Key Contributions
Introduces CenterBench, a dataset of 9,720 comprehension questions on center-embedded sentences designed to distinguish structural understanding from semantic pattern matching. It quantifies when language models abandon structure for shortcuts by comparing performance on plausible vs. implausible sentences across varying complexity.
Business Value
Provides a critical tool for researchers and developers to better understand the true linguistic capabilities of LLMs, leading to more robust and reliable AI systems.