arxiv_cl 95% Match Research Paper AI researchers,NLP engineers,ML evaluation specialists 1 week ago

MERGE: Minimal Expression-Replacement GEneralization Test for Natural Language Inference

large-language-models › evaluation

📄 Abstract

Abstract: In recent years, many generalization benchmarks have shown language models' lack of robustness in natural language inference (NLI). However, manually creating new benchmarks is costly, while automatically generating high-quality ones, even by modifying existing benchmarks, is extremely difficult. In this paper, we propose a methodology for automatically generating high-quality variants of original NLI problems by replacing open-class words, while crucially preserving their underlying reasoning. We dub our generalization test as MERGE (Minimal Expression-Replacements GEneralization), which evaluates the correctness of models' predictions across reasoning-preserving variants of the original problem. Our results show that NLI models' perform 4-20% worse on variants, suggesting low generalizability even on such minimally altered problems. We also analyse how word class of the replacements, word probability, and plausibility influence NLI models' performance.

Authors (3)

Mădălina Zgreabăn

Tejaswini Deoskar

Lasha Abzianidze

Submitted

October 28, 2025

arXiv Category

cs.CL

arXiv PDF

Key Contributions

Proposes MERGE, a methodology for automatically generating high-quality variants of NLI problems by replacing open-class words while preserving reasoning. This method addresses the cost and difficulty of manual benchmark creation and reveals that NLI models perform significantly worse on these minimally altered problems, indicating low generalizability.

Business Value

Improves the reliability and trustworthiness of language models by providing a more rigorous evaluation of their generalization capabilities. This can lead to more robust AI systems in applications requiring nuanced language understanding.

Paper Metadata

Innovation Type

Methodology

Deployment Feasibility

High, as it's an evaluation methodology that can be applied to existing models without requiring new training data or infrastructure.

Limitations Addressed

Cost and difficulty of manual benchmark creation, lack of robustness in existing NLI models.

Performance Gains

NLI models perform 4-20% worse on MERGE variants.

Technical Tags

natural language inferencegeneralizationrobustnessbenchmark generationword replacementreasoning preservationlanguage modelsNLI

Research Topics

Evaluating Language Model GeneralizationAutomated Benchmark CreationRobustness in NLILinguistic Reasoning

Methods & Architectures

Minimal Expression-Replacement Generalization (MERGE)Automated generation of NLI variantsWord replacement techniques

Applications & Tasks

Natural Language Processing AI Evaluation Lack of robustness in language modelsDifficulty in creating high-quality generalization benchmarksEvaluating reasoning capabilities Natural Language Inference (NLI)Generalization testing

Related Fields

Computational LinguisticsMachine Learning EvaluationNatural Language Understanding

Keywords

Natural Language InferenceGeneralizationRobustnessLanguage ModelsBenchmarkEvaluationReasoningNLIAutomated GenerationWord ReplacementLinguistic Analysis

Academic Context

#Evaluating Language Model Generalization#Automated Benchmark Creation#Robustness in NLI#Linguistic Reasoning

Commercial Potential

Use Case Examples

Evaluating the robustness of chatbotsAssessing the generalization of NLU systems

Competitive Edge

Offers a more automated and scalable approach to creating challenging evaluation datasets compared to manual creation, and targets a specific weakness (generalization) not fully addressed by existing benchmarks.

Resource Requirements

Compute Needs

Low (for evaluation)

Data Requirements

Existing NLI datasets to generate variants from.

Scalability

The methodology is designed for scalability due to its automated nature.

Production Readiness

Maturity Level

Research

View Full Paper Back to Papers