Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_ai 95% Match Research Paper AI Researchers,NLP Engineers,Linguists,Cognitive Scientists 2 weeks ago

The Dog the Cat Chased Stumped the Model: Measuring When Language Models Abandon Structure for Shortcuts

large-language-models › evaluation
📄 Abstract

Abstract: When language models correctly parse "The cat that the dog chased meowed," are they analyzing syntax or simply familiar with dogs chasing cats? Despite extensive benchmarking, we lack methods to distinguish structural understanding from semantic pattern matching. We introduce CenterBench, a dataset of 9,720 comprehension questions on center-embedded sentences (like "The cat [that the dog chased] meowed") where relative clauses nest recursively, creating processing demands from simple to deeply nested structures. Each sentence has a syntactically identical but semantically implausible counterpart (e.g., mailmen prescribe medicine, doctors deliver mail) and six comprehension questions testing surface understanding, syntactic dependencies, and causal reasoning. Testing six models reveals that performance gaps between plausible and implausible sentences widen systematically with complexity, with models showing median gaps up to 26.8 percentage points, quantifying when they abandon structural analysis for semantic associations. Notably, semantic plausibility harms performance on questions about resulting actions, where following causal relationships matters more than semantic coherence. Reasoning models improve accuracy but their traces show semantic shortcuts, overthinking, and answer refusal. Unlike models whose plausibility advantage systematically widens with complexity, humans shows variable semantic effects. CenterBench provides the first framework to identify when models shift from structural analysis to pattern matching.
Authors (3)
Sangmitra Madhusudan
Kaige Chen
Ali Emami
Submitted
October 23, 2025
arXiv Category
cs.CL
arXiv PDF

Key Contributions

Introduces CenterBench, a dataset of 9,720 comprehension questions on center-embedded sentences designed to distinguish structural understanding from semantic pattern matching. It quantifies when language models abandon structure for shortcuts by comparing performance on plausible vs. implausible sentences across varying complexity.

Business Value

Provides a critical tool for researchers and developers to better understand the true linguistic capabilities of LLMs, leading to more robust and reliable AI systems.