Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_ml 95% Match Research Paper Robotics researchers,VLM researchers,AI developers working on embodied agents 2 weeks ago

FlySearch: Exploring how vision-language models explore

robotics β€Ί navigation
πŸ“„ Abstract

Abstract: The real world is messy and unstructured. Uncovering critical information often requires active, goal-driven exploration. It remains to be seen whether Vision-Language Models (VLMs), which recently emerged as a popular zero-shot tool in many difficult tasks, can operate effectively in such conditions. In this paper, we answer this question by introducing FlySearch, a 3D, outdoor, photorealistic environment for searching and navigating to objects in complex scenes. We define three sets of scenarios with varying difficulty and observe that state-of-the-art VLMs cannot reliably solve even the simplest exploration tasks, with the gap to human performance increasing as the tasks get harder. We identify a set of central causes, ranging from vision hallucination, through context misunderstanding, to task planning failures, and we show that some of them can be addressed by finetuning. We publicly release the benchmark, scenarios, and the underlying codebase.
Authors (6)
Adam Pardyl
Dominik Matuszek
Mateusz Przebieracz
Marek Cygan
Bartosz ZieliΕ„ski
Maciej WoΕ‚czyk
Submitted
June 3, 2025
arXiv Category
cs.CV
arXiv PDF

Key Contributions

Introduces FlySearch, a 3D photorealistic benchmark for evaluating Vision-Language Models (VLMs) in active, goal-driven exploration tasks. It reveals that current state-of-the-art VLMs struggle with these tasks, identifying key failure modes like vision hallucination and planning failures, while also showing potential for improvement through fine-tuning.

Business Value

Drives the development of more capable embodied AI agents for applications like autonomous delivery, search and rescue, and interactive virtual environments.