Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
π Abstract
Abstract: In recent years, many generalization benchmarks have shown language models'
lack of robustness in natural language inference (NLI). However, manually
creating new benchmarks is costly, while automatically generating high-quality
ones, even by modifying existing benchmarks, is extremely difficult. In this
paper, we propose a methodology for automatically generating high-quality
variants of original NLI problems by replacing open-class words, while
crucially preserving their underlying reasoning. We dub our generalization test
as MERGE (Minimal Expression-Replacements GEneralization), which evaluates the
correctness of models' predictions across reasoning-preserving variants of the
original problem. Our results show that NLI models' perform 4-20% worse on
variants, suggesting low generalizability even on such minimally altered
problems. We also analyse how word class of the replacements, word probability,
and plausibility influence NLI models' performance.
Authors (3)
MΔdΔlina ZgreabΔn
Tejaswini Deoskar
Lasha Abzianidze
Submitted
October 28, 2025
Key Contributions
Proposes MERGE, a methodology for automatically generating high-quality variants of NLI problems by replacing open-class words while preserving reasoning. This method addresses the cost and difficulty of manual benchmark creation and reveals that NLI models perform significantly worse on these minimally altered problems, indicating low generalizability.
Business Value
Improves the reliability and trustworthiness of language models by providing a more rigorous evaluation of their generalization capabilities. This can lead to more robust AI systems in applications requiring nuanced language understanding.