Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Urban monitoring of public infrastructure (such as waste bins, road signs,
vegetation, sidewalks, and construction sites) poses significant challenges due
to the diversity of objects, environments, and contextual conditions involved.
Current state-of-the-art approaches typically rely on a combination of IoT
sensors and manual inspections, which are costly, difficult to scale, and often
misaligned with citizens' perception formed through direct visual observation.
This raises a critical question: Can machines now "see" like citizens and infer
informed opinions about the condition of urban infrastructure? Vision-Language
Models (VLMs), which integrate visual understanding with natural language
reasoning, have recently demonstrated impressive capabilities in processing
complex visual information, turning them into a promising technology to address
this challenge. This systematic review investigates the role of VLMs in urban
monitoring, with particular emphasis on zero-shot applications. Following the
PRISMA methodology, we analyzed 32 peer-reviewed studies published between 2021
and 2025 to address four core research questions: (1) What urban monitoring
tasks have been effectively addressed using VLMs? (2) Which VLM architectures
and frameworks are most commonly used and demonstrate superior performance? (3)
What datasets and resources support this emerging field? (4) How are VLM-based
applications evaluated, and what performance levels have been reported?
Key Contributions
This paper provides a comprehensive review of Vision-Language Models (VLMs) for general urban monitoring, highlighting their potential to bridge the gap between machine perception and citizen observation. It evaluates current VLM capabilities in this domain, particularly for zero-shot applications, and outlines a research agenda to advance the field.
Business Value
VLMs offer a scalable and potentially more cost-effective way to monitor urban infrastructure, enabling proactive maintenance, better resource allocation, and improved citizen engagement in city management.