Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
Foley Control introduces a lightweight approach to video-guided Foley generation by freezing pretrained single-modality models and learning only a small cross-attention bridge. This method effectively aligns text-to-audio models with video content, preserving prompt-driven control and modularity while achieving competitive temporal and semantic alignment with significantly fewer trainable parameters.
Enables more efficient and controllable generation of sound effects for videos, potentially reducing production costs and time for content creators and filmmakers.