Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Effectively understanding urban scenes requires fine-grained spatial
reasoning about objects, layouts, and depth cues. However, how well current
vision-language models (VLMs), pretrained on general scenes, transfer these
abilities to urban domain remains underexplored. To address this gap, we
conduct a comparative study of three off-the-shelf VLMs-BLIP-2, InstructBLIP,
and LLaVA-1.5-evaluating both zero-shot performance and the effects of
fine-tuning with a synthetic VQA dataset specific to urban scenes. We construct
such dataset from segmentation, depth, and object detection predictions of
street-view images, pairing each question with LLM-generated Chain-of-Thought
(CoT) answers for step-by-step reasoning supervision. Results show that while
VLMs perform reasonably well in zero-shot settings, fine-tuning with our
synthetic CoT-supervised dataset substantially boosts performance, especially
for challenging question types such as negation and counterfactuals. This study
introduces urban spatial reasoning as a new challenge for VLMs and demonstrates
synthetic dataset construction as a practical path for adapting general-purpose
models to specialized domains.