Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: The remarkable zero-shot reasoning capabilities of large-scale Visual
Language Models (VLMs) on static images have yet to be fully translated to the
video domain. Conventional video understanding models often rely on extensive,
task-specific training on annotated datasets, a process that is both costly and
limited in scalability. This paper introduces a novel, training-free framework
for video understanding that circumvents end-to-end training by synergistically
combining the rich semantic priors of pre-trained VLMs with classic machine
learning algorithms for pattern discovery. Our core idea is to reframe video
understanding as a self-supervised spatio-temporal clustering problem within a
high-dimensional semantic feature space. The proposed pipeline first transforms
a video stream into a semantic feature trajectory using the frozen visual
encoder of a pre-trained VLM. Subsequently, we employ Kernel Temporal
Segmentation (KTS), a robust machine learning technique, to partition the
continuous feature stream into discrete, semantically coherent event segments.
These segments are then subjected to unsupervised density-based clustering to
identify recurring macroscopic scenes and themes throughout the video. By
selecting representative keyframes from each discovered cluster and leveraging
the VLM's generative capabilities for textual description, our framework
automatically produces a structured, multi-modal summary of the video content.
This approach provides an effective, interpretable, and model-agnostic pathway
for zero-shot, automated structural analysis of video content.
Submitted
October 19, 2025
Key Contributions
Introduces a novel, training-free framework for video understanding that leverages pre-trained VLMs and self-supervised spatio-temporal clustering. It reframes video understanding as clustering semantic feature trajectories, enabling zero-shot reasoning without task-specific training.
Business Value
Enables rapid deployment of video analysis capabilities for various applications without the need for large, labeled video datasets or extensive model retraining, reducing development time and cost.