Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Recent Vision-and-Language Navigation (VLN) advancements are promising, but
their idealized assumptions about robot movement and control fail to reflect
physically embodied deployment challenges. To bridge this gap, we introduce
VLN-PE, a physically realistic VLN platform supporting humanoid, quadruped, and
wheeled robots. For the first time, we systematically evaluate several
ego-centric VLN methods in physical robotic settings across different technical
pipelines, including classification models for single-step discrete action
prediction, a diffusion model for dense waypoint prediction, and a train-free,
map-based large language model (LLM) integrated with path planning. Our results
reveal significant performance degradation due to limited robot observation
space, environmental lighting variations, and physical challenges like
collisions and falls. This also exposes locomotion constraints for legged
robots in complex environments. VLN-PE is highly extensible, allowing seamless
integration of new scenes beyond MP3D, thereby enabling more comprehensive VLN
evaluation. Despite the weak generalization of current models in physical
deployment, VLN-PE provides a new pathway for improving cross-embodiment's
overall adaptability. We hope our findings and tools inspire the community to
rethink VLN limitations and advance robust, practical VLN models. The code is
available at https://crystalsixone.github.io/vln_pe.github.io/.