Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_cv 90% Match Research Paper AI researchers,Computer vision scientists,Developers of AI systems requiring visual reasoning,ML engineers 1 day ago

Diffusion Classifiers Understand Compositionality, but Conditions Apply

computer-vision › scene-understanding
📄 Abstract

Abstract: Understanding visual scenes is fundamental to human intelligence. While discriminative models have significantly advanced computer vision, they often struggle with compositional understanding. In contrast, recent generative text-to-image diffusion models excel at synthesizing complex scenes, suggesting inherent compositional capabilities. Building on this, zero-shot diffusion classifiers have been proposed to repurpose diffusion models for discriminative tasks. While prior work offered promising results in discriminative compositional scenarios, these results remain preliminary due to a small number of benchmarks and a relatively shallow analysis of conditions under which the models succeed. To address this, we present a comprehensive study of the discriminative capabilities of diffusion classifiers on a wide range of compositional tasks. Specifically, our study covers three diffusion models (SD 1.5, 2.0, and, for the first time, 3-m) spanning 10 datasets and over 30 tasks. Further, we shed light on the role that target dataset domains play in respective performance; to isolate the domain effects, we introduce a new diagnostic benchmark \textsc{Self-Bench} comprised of images created by diffusion models themselves. Finally, we explore the importance of timestep weighting and uncover a relationship between domain gap and timestep sensitivity, particularly for SD3-m. To sum up, diffusion classifiers understand compositionality, but conditions apply! Code and dataset are available at https://github.com/eugene6923/Diffusion-Classifiers-Compositionality.
Authors (4)
Yujin Jeong
Arnas Uselis
Seong Joon Oh
Anna Rohrbach
Submitted
May 23, 2025
arXiv Category
cs.CV
arXiv PDF

Key Contributions

This paper presents a comprehensive study on the compositional understanding capabilities of diffusion classifiers across multiple models (SD 1.5, 2.0, 3-m) and datasets. It moves beyond preliminary results by analyzing performance across over 30 tasks and 10 datasets, identifying the conditions under which these models succeed in discriminative compositional scenarios.

Business Value

Helps in understanding the reliability and limitations of generative models when repurposed for discriminative tasks, crucial for applications requiring robust visual understanding and reasoning.