Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Integrating audio and visual data for training multimodal foundational models
remains a challenge. The Audio-Video Vector Alignment (AVVA) framework
addresses this by considering AV scene alignment beyond mere temporal
synchronization, and leveraging Large Language Models (LLMs) for data curation.
AVVA implements a scoring mechanism for selecting aligned training data
segments. It integrates Whisper, a speech-based foundation model, for audio and
DINOv2 for video analysis in a dual-encoder structure with contrastive learning
on AV pairs. Evaluations on AudioCaps, VALOR, and VGGSound demonstrate the
effectiveness of the proposed model architecture and data curation approach.
AVVA achieves a significant improvement in top-k accuracies for video-to-audio
retrieval on all datasets compared to DenseAV, while using only 192 hrs of
curated training data. Furthermore, an ablation study indicates that the data
curation process effectively trades data quality for data quantity, yielding
increases in top-k retrieval accuracies on AudioCaps, VALOR, and VGGSound,
compared to training on the full spectrum of uncurated data.
Authors (3)
Ali Vosoughi
Dimitra Emmanouilidou
Hannes Gamper
Key Contributions
This paper introduces the AVVA framework, which uses LLMs for data curation to train a data-efficient audio-video foundation model. It achieves AV scene alignment beyond temporal synchronization and demonstrates significant improvements in video-to-audio retrieval using only 192 hours of curated data, showing quality can trump quantity.
Business Value
Enables the creation of more powerful multimodal AI systems that can understand and process video and audio content more effectively, leading to applications in content analysis, recommendation systems, and surveillance.