arxiv_cv 95% Match Research Paper AI Researchers,Multimedia Engineers,Sound Designers,Filmmakers 1 week ago

Foley Control: Aligning a Frozen Latent Text-to-Audio Model to Video

generative-ai › diffusion

📄 Abstract

Abstract: Foley Control is a lightweight approach to video-guided Foley that keeps pretrained single-modality models frozen and learns only a small cross-attention bridge between them. We connect V-JEPA2 video embeddings to a frozen Stable Audio Open DiT text-to-audio (T2A) model by inserting compact video cross-attention after the model's existing text cross-attention, so prompts set global semantics while video refines timing and local dynamics. The frozen backbones retain strong marginals (video; audio given text) and the bridge learns the audio-video dependency needed for synchronization -- without retraining the audio prior. To cut memory and stabilize training, we pool video tokens before conditioning. On curated video-audio benchmarks, Foley Control delivers competitive temporal and semantic alignment with far fewer trainable parameters than recent multi-modal systems, while preserving prompt-driven controllability and production-friendly modularity (swap/upgrade encoders or the T2A backbone without end-to-end retraining). Although we focus on Video-to-Foley, the same bridge design can potentially extend to other audio modalities (e.g., speech).

Authors (6)

Ciara Rowles

Varun Jampani

Simon Donné

Shimon Vainer

Julian Parker

Zach Evans

Submitted

October 24, 2025

arXiv Category

cs.CV

arXiv PDF

Key Contributions

Foley Control introduces a lightweight approach to video-guided Foley generation by freezing pretrained single-modality models and learning only a small cross-attention bridge. This method effectively aligns text-to-audio models with video content, preserving prompt-driven control and modularity while achieving competitive temporal and semantic alignment with significantly fewer trainable parameters.

Business Value

Enables more efficient and controllable generation of sound effects for videos, potentially reducing production costs and time for content creators and filmmakers.

Paper Metadata

Innovation Type

Algorithmic Improvement

Deployment Feasibility

High, due to the lightweight nature and modular design, allowing for easier integration into existing pipelines.

Limitations Addressed

High computational cost of retraining large multi-modal models,Difficulty in achieving precise temporal and semantic alignment between audio and video,Need for efficient adaptation of frozen pretrained models

Performance Gains

Competitive temporal and semantic alignment with far fewer trainable parameters than recent multi-modal systems.

Technical Tags

cross-attentiontext-to-audiovideo-guided audiofrozen modelslatent diffusionmulti-modaltemporal alignmentsemantic alignmentparameter efficiency

Research Topics

Generative Audio SynthesisMulti-modal LearningVideo-to-Audio GenerationEfficient Model AdaptationCross-modal Attention Mechanisms

Methods & Architectures

Cross-attention bridgeToken poolingFreezing pretrained models Stable Audio Open DiTV-JEPA2

Applications & Tasks

Multimedia Content Creation Video Editing Film Production Audio-Video SynchronizationControllable Audio GenerationEfficient Multi-modal Model Training Video-guided Foley generationText-to-audio synthesis with video conditioning

Related Fields

Computer VisionSpeech and Audio ProcessingMachine Learning

Keywords

Foley generationvideo-guided audiotext-to-audiocross-attentionfrozen modelsmulti-modalaudio synthesistemporal synchronizationparameter-efficient learninglatent diffusion models

Academic Context

#Generative Audio Synthesis#Multi-modal Learning#Video-to-Audio Generation#Efficient Model Adaptation#Cross-modal Attention Mechanisms

Technology Stack

Frameworks & Libraries

PyTorch

Programming Languages

Python

Commercial Potential

Potential Products

Automated sound effect generation toolVideo editing plugin for audio synchronization

Target Industries

Media and EntertainmentGamingAdvertising

Use Case Examples

Generating realistic sound effects for movie scenes based on video contentCreating synchronized audio for social media video content

Competitive Edge

Offers a more parameter-efficient and modular alternative to end-to-end multi-modal systems for video-guided audio generation.

Market Opportunity

Growing market for AI-powered content creation tools.

Revenue Models

SaaS subscription for a generation servicelicensing of the technology.

Resource Requirements

Compute Needs

Moderate, due to freezing most parameters and learning a small bridge.

Data Requirements

Requires paired video and audio data for training the cross-attention bridge.

Scalability

Scalable due to the modular design and efficient adaptation of frozen models.

Production Readiness

Maturity Level

Research

Time to Market

1-2 years

Patent Potential

Low

View Full Paper Back to Papers