Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_cv 90% Match Research Paper AI Researchers,Computer Vision Engineers,NLP Engineers,Robotics Engineers,Video Analysis Specialists 4 days ago

Enhancing Spatio-Temporal Zero-shot Action Recognition with Language-driven Description Attributes

large-language-models › multimodal-llms
📄 Abstract

Abstract: Vision-Language Models (VLMs) have demonstrated impressive capabilities in zero-shot action recognition by learning to associate video embeddings with class embeddings. However, a significant challenge arises when relying solely on action classes to provide semantic context, particularly due to the presence of multi-semantic words, which can introduce ambiguity in understanding the intended concepts of actions. To address this issue, we propose an innovative approach that harnesses web-crawled descriptions, leveraging a large-language model to extract relevant keywords. This method reduces the need for human annotators and eliminates the laborious manual process of attribute data creation. Additionally, we introduce a spatio-temporal interaction module designed to focus on objects and action units, facilitating alignment between description attributes and video content. In our zero-shot experiments, our model achieves impressive results, attaining accuracies of 81.0%, 53.1%, and 68.9% on UCF-101, HMDB-51, and Kinetics-600, respectively, underscoring the model's adaptability and effectiveness across various downstream tasks.
Authors (3)
Yehna Kim
Young-Eun Kim
Seong-Whan Lee
Submitted
October 31, 2025
arXiv Category
cs.CV
arXiv PDF

Key Contributions

This paper proposes an innovative approach to enhance spatio-temporal zero-shot action recognition by leveraging language-driven description attributes extracted from web-crawled data using LLMs. This method reduces reliance on manual annotation and introduces a spatio-temporal interaction module to align attributes with video content, improving recognition accuracy.

Business Value

Enables more accurate and flexible video analysis systems that can recognize novel actions without explicit training, valuable for surveillance, content moderation, and robotics.