Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
π Abstract
Abstract: The real world is messy and unstructured. Uncovering critical information
often requires active, goal-driven exploration. It remains to be seen whether
Vision-Language Models (VLMs), which recently emerged as a popular zero-shot
tool in many difficult tasks, can operate effectively in such conditions. In
this paper, we answer this question by introducing FlySearch, a 3D, outdoor,
photorealistic environment for searching and navigating to objects in complex
scenes. We define three sets of scenarios with varying difficulty and observe
that state-of-the-art VLMs cannot reliably solve even the simplest exploration
tasks, with the gap to human performance increasing as the tasks get harder. We
identify a set of central causes, ranging from vision hallucination, through
context misunderstanding, to task planning failures, and we show that some of
them can be addressed by finetuning. We publicly release the benchmark,
scenarios, and the underlying codebase.
Authors (6)
Adam Pardyl
Dominik Matuszek
Mateusz Przebieracz
Marek Cygan
Bartosz ZieliΕski
Maciej WoΕczyk
Key Contributions
Introduces FlySearch, a 3D photorealistic benchmark for evaluating Vision-Language Models (VLMs) in active, goal-driven exploration tasks. It reveals that current state-of-the-art VLMs struggle with these tasks, identifying key failure modes like vision hallucination and planning failures, while also showing potential for improvement through fine-tuning.
Business Value
Drives the development of more capable embodied AI agents for applications like autonomous delivery, search and rescue, and interactive virtual environments.