Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Vision Language Action (VLA) models are widely used in Embodied AI, enabling
robots to interpret and execute language instructions. However, their
robustness to natural language variability in real-world scenarios has not been
thoroughly investigated. In this work, we present a novel systematic study of
the robustness of state-of-the-art VLA models under linguistic perturbations.
Specifically, we evaluate model performance under two types of instruction
noise: (1) human-generated paraphrasing and (2) the addition of irrelevant
context. We further categorize irrelevant contexts into two groups according to
their length and their semantic and lexical proximity to robot commands. In
this study, we observe consistent performance degradation as context size
expands. We also demonstrate that the model can exhibit relative robustness to
random context, with a performance drop within 10%, while semantically and
lexically similar context of the same length can trigger a quality decline of
around 50%. Human paraphrases of instructions lead to a drop of nearly 20%. To
mitigate this, we propose an LLM-based filtering framework that extracts core
commands from noisy inputs. Incorporating our filtering step allows models to
recover up to 98.5% of their original performance under noisy conditions.
Key Contributions
Conducts a novel, systematic study on the robustness of state-of-the-art VLA models in Embodied AI to linguistic perturbations, specifically human-generated paraphrasing and the addition of irrelevant context. It quantifies performance degradation based on context size and type, revealing significant sensitivity to semantically similar irrelevant context.
Business Value
Crucial for developing reliable embodied AI systems that can operate effectively in diverse and unpredictable real-world environments, enhancing user experience and task success.