Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_ai 95% Match Research Paper Computer Vision Researchers,Video Analysis Engineers,AI Researchers 2 weeks ago

Xiaoice: Training-Free Video Understanding via Self-Supervised Spatio-Temporal Clustering of Semantic Features

computer-vision › video-understanding
📄 Abstract

Abstract: The remarkable zero-shot reasoning capabilities of large-scale Visual Language Models (VLMs) on static images have yet to be fully translated to the video domain. Conventional video understanding models often rely on extensive, task-specific training on annotated datasets, a process that is both costly and limited in scalability. This paper introduces a novel, training-free framework for video understanding that circumvents end-to-end training by synergistically combining the rich semantic priors of pre-trained VLMs with classic machine learning algorithms for pattern discovery. Our core idea is to reframe video understanding as a self-supervised spatio-temporal clustering problem within a high-dimensional semantic feature space. The proposed pipeline first transforms a video stream into a semantic feature trajectory using the frozen visual encoder of a pre-trained VLM. Subsequently, we employ Kernel Temporal Segmentation (KTS), a robust machine learning technique, to partition the continuous feature stream into discrete, semantically coherent event segments. These segments are then subjected to unsupervised density-based clustering to identify recurring macroscopic scenes and themes throughout the video. By selecting representative keyframes from each discovered cluster and leveraging the VLM's generative capabilities for textual description, our framework automatically produces a structured, multi-modal summary of the video content. This approach provides an effective, interpretable, and model-agnostic pathway for zero-shot, automated structural analysis of video content.
Authors (2)
Shihao Ji
Zihui Song
Submitted
October 19, 2025
arXiv Category
cs.CV
arXiv PDF

Key Contributions

Introduces a novel, training-free framework for video understanding that leverages pre-trained VLMs and self-supervised spatio-temporal clustering. It reframes video understanding as clustering semantic feature trajectories, enabling zero-shot reasoning without task-specific training.

Business Value

Enables rapid deployment of video analysis capabilities for various applications without the need for large, labeled video datasets or extensive model retraining, reducing development time and cost.