Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Understanding visual scenes is fundamental to human intelligence. While
discriminative models have significantly advanced computer vision, they often
struggle with compositional understanding. In contrast, recent generative
text-to-image diffusion models excel at synthesizing complex scenes, suggesting
inherent compositional capabilities. Building on this, zero-shot diffusion
classifiers have been proposed to repurpose diffusion models for discriminative
tasks. While prior work offered promising results in discriminative
compositional scenarios, these results remain preliminary due to a small number
of benchmarks and a relatively shallow analysis of conditions under which the
models succeed. To address this, we present a comprehensive study of the
discriminative capabilities of diffusion classifiers on a wide range of
compositional tasks. Specifically, our study covers three diffusion models (SD
1.5, 2.0, and, for the first time, 3-m) spanning 10 datasets and over 30 tasks.
Further, we shed light on the role that target dataset domains play in
respective performance; to isolate the domain effects, we introduce a new
diagnostic benchmark \textsc{Self-Bench} comprised of images created by
diffusion models themselves. Finally, we explore the importance of timestep
weighting and uncover a relationship between domain gap and timestep
sensitivity, particularly for SD3-m. To sum up, diffusion classifiers
understand compositionality, but conditions apply! Code and dataset are
available at
https://github.com/eugene6923/Diffusion-Classifiers-Compositionality.
Authors (4)
Yujin Jeong
Arnas Uselis
Seong Joon Oh
Anna Rohrbach
Key Contributions
This paper presents a comprehensive study on the compositional understanding capabilities of diffusion classifiers across multiple models (SD 1.5, 2.0, 3-m) and datasets. It moves beyond preliminary results by analyzing performance across over 30 tasks and 10 datasets, identifying the conditions under which these models succeed in discriminative compositional scenarios.
Business Value
Helps in understanding the reliability and limitations of generative models when repurposed for discriminative tasks, crucial for applications requiring robust visual understanding and reasoning.