Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Real-time urban traffic surveillance is vital for Intelligent Transportation
Systems (ITS) to ensure road safety, optimize traffic flow, track vehicle
trajectories, and prevent collisions in smart cities. Deploying edge cameras
across urban environments is a standard practice for monitoring road
conditions. However, integrating these with intelligent models requires a
robust understanding of dynamic traffic scenarios and a responsive interface
for user interaction. Although multimodal Large Language Models (LLMs) can
interpret traffic images and generate informative responses, their deployment
on edge devices is infeasible due to high computational demands. Therefore, LLM
inference must occur on the cloud, necessitating visual data transmission from
edge to cloud, a process hindered by limited bandwidth, leading to potential
delays that compromise real-time performance. To address this challenge, we
propose a semantic communication framework that significantly reduces
transmission overhead. Our method involves detecting Regions of Interest (RoIs)
using YOLOv11, cropping relevant image segments, and converting them into
compact embedding vectors using a Vision Transformer (ViT). These embeddings
are then transmitted to the cloud, where an image decoder reconstructs the
cropped images. The reconstructed images are processed by a multimodal LLM to
generate traffic condition descriptions. This approach achieves a 99.9%
reduction in data transmission size while maintaining an LLM response accuracy
of 89% for reconstructed cropped images, compared to 93% accuracy with original
cropped images. Our results demonstrate the efficiency and practicality of ViT
and LLM-assisted edge-cloud semantic communication for real-time traffic
surveillance.