Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Despite the significant progress of fully-supervised video captioning,
zero-shot methods remain much less explored. In this paper, we propose a novel
zero-shot video captioning framework named Retrieval-Enhanced Test-Time
Adaptation (RETTA), which takes advantage of existing pretrained large-scale
vision and language models to directly generate captions with test-time
adaptation. Specifically, we bridge video and text using four key models: a
general video-text retrieval model XCLIP, a general image-text matching model
CLIP, a text alignment model AnglE, and a text generation model GPT-2, due to
their source-code availability. The main challenge is how to enable the text
generation model to be sufficiently aware of the content in a given video so as
to generate corresponding captions. To address this problem, we propose using
learnable tokens as a communication medium among these four frozen models
GPT-2, XCLIP, CLIP, and AnglE. Different from the conventional way that trains
these tokens with training data, we propose to learn these tokens with soft
targets of the inference data under several carefully crafted loss functions,
which enable the tokens to absorb video information catered for GPT-2. This
procedure can be efficiently done in just a few iterations (we use 16
iterations in the experiments) and does not require ground truth data.
Extensive experimental results on three widely used datasets, MSR-VTT, MSVD,
and VATEX, show absolute 5.1%-32.4% improvements in terms of the main metric
CIDEr compared to several state-of-the-art zero-shot video captioning methods.
Authors (7)
Yunchuan Ma
Laiyun Qing
Guorong Li
Yuankai Qi
Amin Beheshti
Quan Z. Sheng
+1 more
Key Contributions
Proposes RETTA, a novel framework for zero-shot video captioning using test-time adaptation. It effectively bridges video and text by integrating four frozen pre-trained models (XCLIP, CLIP, AnglE, GPT-2) through learnable tokens, enabling caption generation without explicit fine-tuning on target video data.
Business Value
Allows for flexible and efficient video captioning without the need for task-specific training data, making it valuable for platforms dealing with large volumes of user-generated video content.