arxiv_cl 92% Match Research Paper AI researchers,multimodal learning specialists,ML engineers,computer vision engineers 1 week ago

Quality Over Quantity? LLM-Based Curation for a Data-Efficient Audio-Video Foundation Model

large-language-models › multimodal-llms

📄 Abstract

Abstract: Integrating audio and visual data for training multimodal foundational models remains a challenge. The Audio-Video Vector Alignment (AVVA) framework addresses this by considering AV scene alignment beyond mere temporal synchronization, and leveraging Large Language Models (LLMs) for data curation. AVVA implements a scoring mechanism for selecting aligned training data segments. It integrates Whisper, a speech-based foundation model, for audio and DINOv2 for video analysis in a dual-encoder structure with contrastive learning on AV pairs. Evaluations on AudioCaps, VALOR, and VGGSound demonstrate the effectiveness of the proposed model architecture and data curation approach. AVVA achieves a significant improvement in top-k accuracies for video-to-audio retrieval on all datasets compared to DenseAV, while using only 192 hrs of curated training data. Furthermore, an ablation study indicates that the data curation process effectively trades data quality for data quantity, yielding increases in top-k retrieval accuracies on AudioCaps, VALOR, and VGGSound, compared to training on the full spectrum of uncurated data.

Authors (3)

Ali Vosoughi

Dimitra Emmanouilidou

Hannes Gamper

Submitted

March 12, 2025

arXiv Category

cs.MM

arXiv PDF

Key Contributions

This paper introduces the AVVA framework, which uses LLMs for data curation to train a data-efficient audio-video foundation model. It achieves AV scene alignment beyond temporal synchronization and demonstrates significant improvements in video-to-audio retrieval using only 192 hours of curated data, showing quality can trump quantity.

Business Value

Enables the creation of more powerful multimodal AI systems that can understand and process video and audio content more effectively, leading to applications in content analysis, recommendation systems, and surveillance.

Paper Metadata

Innovation Type

Framework and Methodology

Deployment Feasibility

The focus on data efficiency and leveraging existing foundation models (Whisper, DINOv2) suggests good feasibility for building specialized multimodal systems.

Limitations Addressed

Addresses the challenge of integrating audio and visual data for training multimodal foundational models and the need for efficient data curation. It tackles the issue of relying on large, uncurated datasets by proposing an LLM-guided curation process.

Performance Gains

significant improvement in top-k accuracies for video-to-audio retrieval compared to DenseAV,achieved with only 192 hrs of curated training data

Technical Tags

multimodal foundation modelaudio-video alignmentdata curationLLM-based curationcontrastive learningdual-encoder structureWhisperDINOv2AVVA framework

Research Topics

Multimodal learningFoundation modelsData efficiencyAudio-visual understanding

Methods & Architectures

Audio-Video Vector Alignment (AVVA) frameworkLLM-based data curationcontrastive learningdual-encoder architecturespeech-based foundation model (Whisper)vision foundation model (DINOv2) Foundation ModelsDual-encoder modelsWhisperDINOv2Large Language Models (LLMs)

Applications & Tasks

Multimedia analysis Video understanding Audio analysis Content moderation integrating audio and visual dataimproving data efficiency in multimodal trainingachieving AV scene alignment beyond temporal sync audio-video retrievalmultimodal representation learningdata curation for foundation models

Datasets & Benchmarks

Datasets

AudioCaps, VALOR, VGGSound

Benchmarks

video-to-audio retrieval on AudioCaps, VALOR, VGGSound

top-k accuracies

Related Fields

Computer VisionSpeech ProcessingMultimodal LearningFoundation ModelsMachine Learning

Keywords

multimodal learningfoundation modelsaudio-video alignmentdata curationLLMcontrastive learningWhisperDINOv2AVVAdata efficiencyvideo retrievalaudio analysis

Academic Context

#Multimodal learning#Foundation models#Data efficiency#Audio-visual understanding

Commercial Potential

Potential Products

Advanced video analysis platformsContent recommendation enginesMultimodal search engines

Target Industries

Media and EntertainmentTechnologySecurityAdvertising

Use Case Examples

Searching for specific scenes in videos based on audio descriptions.Automatically tagging video content with relevant audio and visual information.Developing AI that can understand the context of a video scene from both visual and auditory cues.

Competitive Edge

Offers a data-efficient approach to multimodal foundation models by leveraging LLMs for curation, potentially outperforming models trained on larger, less curated datasets.

Market Opportunity

Rapid growth in the multimodal AI market.

Revenue Models

Licensing of multimodal modelsSaaS platforms for video/audio analysis.

Resource Requirements

Compute Needs

High, for training foundation models and conducting large-scale experiments.

Data Requirements

Requires curated audio-video pairs, with a focus on quality over quantity.

Deployment Constraints

The effectiveness of the LLM curation process is critical and may require fine-tuning for specific data types.

Scalability

The AVVA framework and LLM curation approach are designed to be scalable for training large multimodal foundation models.

Production Readiness

Maturity Level

Research/Experimental

Time to Market

2-4 years for developing robust multimodal products.

View Full Paper Back to Papers