Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Zero-shot action recognition, which addresses the issue of scalability and
generalization in action recognition and allows the models to adapt to new and
unseen actions dynamically, is an important research topic in computer vision
communities. The key to zero-shot action recognition lies in aligning visual
features with semantic vectors representing action categories. Most existing
methods either directly project visual features onto the semantic space of text
category or learn a shared embedding space between the two modalities. However,
a direct projection cannot accurately align the two modalities, and learning
robust and discriminative embedding space between visual and text
representations is often difficult. To address these issues, we introduce Dual
Visual-Text Alignment (DVTA) for skeleton-based zero-shot action recognition.
The DVTA consists of two alignment modules--Direct Alignment (DA) and Augmented
Alignment (AA)--along with a designed Semantic Description Enhancement (SDE).
The DA module maps the skeleton features to the semantic space through a
specially designed visual projector, followed by the SDE, which is based on
cross-attention to enhance the connection between skeleton and text, thereby
reducing the gap between modalities. The AA module further strengthens the
learning of the embedding space by utilizing deep metric learning to learn the
similarity between skeleton and text. Our approach achieves state-of-the-art
performances on several popular zero-shot skeleton-based action recognition
benchmarks. The code is available at: https://github.com/jidongkuang/DVTA.