Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: We present FLUKE (Framework for LingUistically-driven and tasK-agnostic
robustness Evaluation), a framework for assessing model robustness through
systematic minimal variations of test data. FLUKE introduces controlled
variations across linguistic levels -- from orthography to dialect and style --
and leverages large language models (LLMs) with human validation to generate
modifications. We demonstrate FLUKE's utility by evaluating both fine-tuned
models and LLMs across six diverse NLP tasks (four classification and two
generation tasks), and reveal that (1) the impact of linguistic variations is
highly task-dependent, with some tests being critical for certain tasks but
irrelevant for others; (2) LLMs still exhibit significant brittleness to
certain linguistic variations, with reasoning LLMs surprisingly showing less
robustness on some tasks compared to base models; (3) models are overall more
brittle to natural, fluent modifications such as syntax or style changes (and
especially to negation), compared to corruption-style tests such as letter
flipping; (4) the ability of a model to use a linguistic feature in generation
does not correlate to its robustness to this feature on downstream tasks. These
findings highlight the importance of systematic robustness testing for
understanding model behaviors.
Authors (7)
Yulia Otmakhova
Hung Thinh Truong
Rahmad Mahendra
Zenan Zhai
Rongxin Zhu
Daniel Beck
+1 more
Key Contributions
FLUKE (Framework for LingUistically-driven and tasK-agnostic robustness Evaluation) is a novel framework for assessing NLP model robustness by generating systematic, minimal linguistic variations across different levels (orthography, dialect, style). It reveals that linguistic variations have task-dependent impacts, LLMs exhibit significant brittleness, and models are more vulnerable to natural, fluent modifications.
Business Value
Improves the reliability and trustworthiness of NLP systems in real-world applications where language use is diverse and variable. This is crucial for applications like customer service bots, content moderation, and translation.