Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_cv 95% Match Research Paper AI Researchers,Machine Learning Engineers,Image Generation Developers 1 week ago

FreeGraftor: Training-Free Cross-Image Feature Grafting for Subject-Driven Text-to-Image Generation

generative-ai › diffusion
📄 Abstract

Abstract: Subject-driven image generation aims to synthesize novel scenes that faithfully preserve subject identity from reference images while adhering to textual guidance. However, existing methods struggle with a critical trade-off between fidelity and efficiency. Tuning-based approaches rely on time-consuming and resource-intensive, subject-specific optimization, while zero-shot methods often fail to maintain adequate subject consistency. In this work, we propose FreeGraftor, a training-free framework that addresses these limitations through cross-image feature grafting. Specifically, FreeGraftor leverages semantic matching and position-constrained attention fusion to transfer visual details from reference subjects to the generated images. Additionally, our framework introduces a novel noise initialization strategy to preserve the geometry priors of reference subjects, facilitating robust feature matching. Extensive qualitative and quantitative experiments demonstrate that our method enables precise subject identity transfer while maintaining text-aligned scene synthesis. Without requiring model fine-tuning or additional training, FreeGraftor significantly outperforms existing zero-shot and training-free approaches in both subject fidelity and text alignment. Furthermore, our framework can seamlessly extend to multi-subject generation, making it practical for real-world deployment. Our code is available at https://github.com/Nihukat/FreeGraftor.
Authors (7)
Zebin Yao
Lei Ren
Huixing Jiang
Chen Wei
Xiaojie Wang
Ruifan Li
+1 more
Submitted
April 22, 2025
arXiv Category
cs.CV
arXiv PDF

Key Contributions

FreeGraftor introduces a training-free framework for subject-driven text-to-image generation that overcomes the fidelity-efficiency trade-off. It achieves this by leveraging cross-image feature grafting, semantic matching, and position-constrained attention fusion to transfer visual details and a novel noise initialization strategy to preserve geometry priors, leading to improved subject consistency and efficiency.

Business Value

Enables faster and more efficient creation of personalized images for marketing, design, and entertainment, reducing the need for extensive manual editing or computationally expensive training.