Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_cv 95% Match Research Paper AI Researchers,Multimedia Engineers,Sound Designers,Filmmakers 1 week ago

Foley Control: Aligning a Frozen Latent Text-to-Audio Model to Video

generative-ai β€Ί diffusion
πŸ“„ Abstract

Abstract: Foley Control is a lightweight approach to video-guided Foley that keeps pretrained single-modality models frozen and learns only a small cross-attention bridge between them. We connect V-JEPA2 video embeddings to a frozen Stable Audio Open DiT text-to-audio (T2A) model by inserting compact video cross-attention after the model's existing text cross-attention, so prompts set global semantics while video refines timing and local dynamics. The frozen backbones retain strong marginals (video; audio given text) and the bridge learns the audio-video dependency needed for synchronization -- without retraining the audio prior. To cut memory and stabilize training, we pool video tokens before conditioning. On curated video-audio benchmarks, Foley Control delivers competitive temporal and semantic alignment with far fewer trainable parameters than recent multi-modal systems, while preserving prompt-driven controllability and production-friendly modularity (swap/upgrade encoders or the T2A backbone without end-to-end retraining). Although we focus on Video-to-Foley, the same bridge design can potentially extend to other audio modalities (e.g., speech).
Authors (6)
Ciara Rowles
Varun Jampani
Simon DonnΓ©
Shimon Vainer
Julian Parker
Zach Evans
Submitted
October 24, 2025
arXiv Category
cs.CV
arXiv PDF

Key Contributions

Foley Control introduces a lightweight approach to video-guided Foley generation by freezing pretrained single-modality models and learning only a small cross-attention bridge. This method effectively aligns text-to-audio models with video content, preserving prompt-driven control and modularity while achieving competitive temporal and semantic alignment with significantly fewer trainable parameters.

Business Value

Enables more efficient and controllable generation of sound effects for videos, potentially reducing production costs and time for content creators and filmmakers.