arxiv_ai 95% Match Research Paper AI Researchers,NLP Engineers,Linguists,Cognitive Scientists 2 weeks ago

The Dog the Cat Chased Stumped the Model: Measuring When Language Models Abandon Structure for Shortcuts

large-language-models › evaluation

📄 Abstract

Abstract: When language models correctly parse "The cat that the dog chased meowed," are they analyzing syntax or simply familiar with dogs chasing cats? Despite extensive benchmarking, we lack methods to distinguish structural understanding from semantic pattern matching. We introduce CenterBench, a dataset of 9,720 comprehension questions on center-embedded sentences (like "The cat [that the dog chased] meowed") where relative clauses nest recursively, creating processing demands from simple to deeply nested structures. Each sentence has a syntactically identical but semantically implausible counterpart (e.g., mailmen prescribe medicine, doctors deliver mail) and six comprehension questions testing surface understanding, syntactic dependencies, and causal reasoning. Testing six models reveals that performance gaps between plausible and implausible sentences widen systematically with complexity, with models showing median gaps up to 26.8 percentage points, quantifying when they abandon structural analysis for semantic associations. Notably, semantic plausibility harms performance on questions about resulting actions, where following causal relationships matters more than semantic coherence. Reasoning models improve accuracy but their traces show semantic shortcuts, overthinking, and answer refusal. Unlike models whose plausibility advantage systematically widens with complexity, humans shows variable semantic effects. CenterBench provides the first framework to identify when models shift from structural analysis to pattern matching.

Authors (3)

Sangmitra Madhusudan

Kaige Chen

Ali Emami

Submitted

October 23, 2025

arXiv Category

cs.CL

arXiv PDF

Key Contributions

Introduces CenterBench, a dataset of 9,720 comprehension questions on center-embedded sentences designed to distinguish structural understanding from semantic pattern matching. It quantifies when language models abandon structure for shortcuts by comparing performance on plausible vs. implausible sentences across varying complexity.

Business Value

Provides a critical tool for researchers and developers to better understand the true linguistic capabilities of LLMs, leading to more robust and reliable AI systems.

Paper Metadata

Innovation Type

Dataset/Evaluation Methodology

Deployment Feasibility

N/A (Evaluation dataset)

Limitations Addressed

Lack of methods to distinguish true structural understanding from semantic pattern matching in language models.

Technical Tags

Center-embedded SentencesSyntactic UnderstandingSemantic Pattern MatchingLanguage Model EvaluationCenterBench DatasetRecursive NestingComprehension QuestionsCausal ReasoningLLM Benchmarking

Research Topics

Language UnderstandingModel EvaluationSyntactic ParsingReasoning CapabilitiesLLM Benchmarking

Methods & Architectures

Dataset Creation (CenterBench)Controlled ExperimentationPerformance AnalysisComparative Evaluation

Applications & Tasks

Natural Language Processing Artificial Intelligence Distinguishing structural understanding from semantic pattern matchingMeasuring true syntactic comprehensionIdentifying when models abandon structure for shortcuts Evaluating syntactic understanding of LLMsMeasuring model reliance on semantic shortcutsAssessing comprehension of complex sentence structures

Datasets & Benchmarks

Datasets

CenterBench

Benchmarks

Performance gaps between plausible and implausible sentences up to 26.8 percentage points.

AccuracyPerformance Gap

Related Fields

LinguisticsNatural Language ProcessingCognitive ScienceArtificial IntelligenceMachine Learning Evaluation

Keywords

LLM EvaluationSyntactic UnderstandingCenterBenchLanguage ModelsNatural Language ProcessingBenchmarkingLinguistic CompetenceSemantic AmbiguityRecursive StructuresAI Testing

Academic Context

#Language Understanding#Model Evaluation#Syntactic Parsing#Reasoning Capabilities#LLM Benchmarking

Commercial Potential

Use Case Examples

Testing the robustness of LLMs against adversarial examples designed to exploit semantic shortcuts.Benchmarking new LLM architectures for genuine linguistic understanding.

Competitive Edge

Offers a novel evaluation methodology and dataset that specifically targets a known weakness in LLMs, providing deeper insights than general benchmarks.

Market Opportunity

N/A

Revenue Models

N/A

Resource Requirements

Compute Needs

Moderate for running evaluations.

Data Requirements

The CenterBench dataset.

Deployment Constraints

N/A

Scalability

The dataset can be scaled to include more complex structures or linguistic phenomena.

Regulatory Considerations

N/A

Production Readiness

Maturity Level

Evaluation Benchmark

Time to Market

N/A

Licensing

Not specified

Patent Potential

Low

View Full Paper Back to Papers