Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Robot navigation in dynamic, human-centered environments requires
socially-compliant decisions grounded in robust scene understanding. Recent
Vision-Language Models (VLMs) exhibit promising capabilities such as object
recognition, common-sense reasoning, and contextual understanding-capabilities
that align with the nuanced requirements of social robot navigation. However,
it remains unclear whether VLMs can accurately understand complex social
navigation scenes (e.g., inferring the spatial-temporal relations among agents
and human intentions), which is essential for safe and socially compliant robot
navigation. While some recent works have explored the use of VLMs in social
robot navigation, no existing work systematically evaluates their ability to
meet these necessary conditions. In this paper, we introduce the Social
Navigation Scene Understanding Benchmark (SocialNav-SUB), a Visual Question
Answering (VQA) dataset and benchmark designed to evaluate VLMs for scene
understanding in real-world social robot navigation scenarios. SocialNav-SUB
provides a unified framework for evaluating VLMs against human and rule-based
baselines across VQA tasks requiring spatial, spatiotemporal, and social
reasoning in social robot navigation. Through experiments with state-of-the-art
VLMs, we find that while the best-performing VLM achieves an encouraging
probability of agreeing with human answers, it still underperforms simpler
rule-based approach and human consensus baselines, indicating critical gaps in
social scene understanding of current VLMs. Our benchmark sets the stage for
further research on foundation models for social robot navigation, offering a
framework to explore how VLMs can be tailored to meet real-world social robot
navigation needs. An overview of this paper along with the code and data can be
found at https://larg.github.io/socialnav-sub .