Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_cv 95% Match Research Paper AI researchers,ML engineers,NLP practitioners,Computer vision engineers 1 week ago

RETTA: Retrieval-Enhanced Test-Time Adaptation for Zero-Shot Video Captioning

large-language-models › multimodal-llms
📄 Abstract

Abstract: Despite the significant progress of fully-supervised video captioning, zero-shot methods remain much less explored. In this paper, we propose a novel zero-shot video captioning framework named Retrieval-Enhanced Test-Time Adaptation (RETTA), which takes advantage of existing pretrained large-scale vision and language models to directly generate captions with test-time adaptation. Specifically, we bridge video and text using four key models: a general video-text retrieval model XCLIP, a general image-text matching model CLIP, a text alignment model AnglE, and a text generation model GPT-2, due to their source-code availability. The main challenge is how to enable the text generation model to be sufficiently aware of the content in a given video so as to generate corresponding captions. To address this problem, we propose using learnable tokens as a communication medium among these four frozen models GPT-2, XCLIP, CLIP, and AnglE. Different from the conventional way that trains these tokens with training data, we propose to learn these tokens with soft targets of the inference data under several carefully crafted loss functions, which enable the tokens to absorb video information catered for GPT-2. This procedure can be efficiently done in just a few iterations (we use 16 iterations in the experiments) and does not require ground truth data. Extensive experimental results on three widely used datasets, MSR-VTT, MSVD, and VATEX, show absolute 5.1%-32.4% improvements in terms of the main metric CIDEr compared to several state-of-the-art zero-shot video captioning methods.
Authors (7)
Yunchuan Ma
Laiyun Qing
Guorong Li
Yuankai Qi
Amin Beheshti
Quan Z. Sheng
+1 more
Submitted
May 11, 2024
arXiv Category
cs.CV
arXiv PDF

Key Contributions

Proposes RETTA, a novel framework for zero-shot video captioning using test-time adaptation. It effectively bridges video and text by integrating four frozen pre-trained models (XCLIP, CLIP, AnglE, GPT-2) through learnable tokens, enabling caption generation without explicit fine-tuning on target video data.

Business Value

Allows for flexible and efficient video captioning without the need for task-specific training data, making it valuable for platforms dealing with large volumes of user-generated video content.