Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Vision-Language Models (VLMs) have demonstrated impressive capabilities in
zero-shot action recognition by learning to associate video embeddings with
class embeddings. However, a significant challenge arises when relying solely
on action classes to provide semantic context, particularly due to the presence
of multi-semantic words, which can introduce ambiguity in understanding the
intended concepts of actions. To address this issue, we propose an innovative
approach that harnesses web-crawled descriptions, leveraging a large-language
model to extract relevant keywords. This method reduces the need for human
annotators and eliminates the laborious manual process of attribute data
creation. Additionally, we introduce a spatio-temporal interaction module
designed to focus on objects and action units, facilitating alignment between
description attributes and video content. In our zero-shot experiments, our
model achieves impressive results, attaining accuracies of 81.0%, 53.1%, and
68.9% on UCF-101, HMDB-51, and Kinetics-600, respectively, underscoring the
model's adaptability and effectiveness across various downstream tasks.
Authors (3)
Yehna Kim
Young-Eun Kim
Seong-Whan Lee
Submitted
October 31, 2025
Key Contributions
This paper proposes an innovative approach to enhance spatio-temporal zero-shot action recognition by leveraging language-driven description attributes extracted from web-crawled data using LLMs. This method reduces reliance on manual annotation and introduces a spatio-temporal interaction module to align attributes with video content, improving recognition accuracy.
Business Value
Enables more accurate and flexible video analysis systems that can recognize novel actions without explicit training, valuable for surveillance, content moderation, and robotics.