Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Semantic segmentation models trained on synthetic data often perform poorly
on real-world images due to domain gaps, particularly in adverse conditions
where labeled data is scarce. Yet, recent foundation models enable to generate
realistic images without any training. This paper proposes to leverage such
diffusion models to improve the performance of vision models when learned on
synthetic data. We introduce two novel techniques for semantically consistent
style transfer using diffusion models: Class-wise Adaptive Instance
Normalization and Cross-Attention (CACTI) and its extension with selective
attention Filtering (CACTIF). CACTI applies statistical normalization
selectively based on semantic classes, while CACTIF further filters
cross-attention maps based on feature similarity, preventing artifacts in
regions with weak cross-attention correspondences. Our methods transfer style
characteristics while preserving semantic boundaries and structural coherence,
unlike approaches that apply global transformations or generate content without
constraints. Experiments using GTA5 as source and Cityscapes/ACDC as target
domains show that our approach produces higher quality images with lower FID
scores and better content preservation. Our work demonstrates that class-aware
diffusion-based style transfer effectively bridges the synthetic-to-real domain
gap even with minimal target domain data, advancing robust perception systems
for challenging real-world applications. The source code is available at:
https://github.com/echigot/cactif.