Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Recent progress in text-to-image (T2I) generative models has led to
significant improvements in generating high-quality images aligned with text
prompts. However, these models still struggle with prompts involving multiple
objects, often resulting in object neglect or object mixing. Through extensive
studies, we identify four problematic scenarios, Similar Shapes, Similar
Textures, Dissimilar Background Biases, and Many Objects, where inter-object
relationships frequently lead to such failures. Motivated by two key
observations about CLIP embeddings, we propose DOS (Directional Object
Separation), a method that modifies three types of CLIP text embeddings before
passing them into text-to-image models. Experimental results show that DOS
consistently improves the success rate of multi-object image generation and
reduces object mixing. In human evaluations, DOS significantly outperforms four
competing methods, receiving 26.24%-43.04% more votes across four benchmarks.
These results highlight DOS as a practical and effective solution for improving
multi-object image generation.
Authors (5)
Dongnam Byun
Jungwon Park
Jumgmin Ko
Changin Choi
Wonjong Rhee
Submitted
October 16, 2025
Key Contributions
DOS (Directional Object Separation) is proposed to improve multi-object image generation by modifying CLIP text embeddings before they are fed into text-to-image models. It addresses common failures like object neglect and mixing by better separating directional cues for individual objects, significantly improving generation success rates.
Business Value
Enables the creation of more accurate and controllable visual content from text descriptions, valuable for graphic design, advertising, and personalized content generation.