Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: We learn visual features by captioning images with an image-conditioned
masked diffusion language model, a formulation we call masked diffusion
captioning (MDC). During training, text tokens in each image-caption pair are
masked at a randomly chosen ratio, and a decoder conditioned on visual features
is trained to reconstruct the original text. After training, the learned visual
features can be applied to downstream vision tasks. Unlike autoregressive
captioning, the strength of the visual learning signal in MDC does not depend
on each token's position in the sequence, reducing the need for auxiliary
objectives. Linear probing experiments across a variety of academic-scale
models and datasets show that the learned visual features are competitive with
those produced by autoregressive and contrastive approaches.
Authors (3)
Chao Feng
Zihao Wei
Andrew Owens
Submitted
October 30, 2025
Key Contributions
This paper proposes Masked Diffusion Captioning (MDC), a novel approach to learn visual features by using an image-conditioned masked diffusion language model to caption images. MDC masks text tokens during training and trains a decoder to reconstruct them based on visual features, decoupling the learning signal from token position. This method achieves competitive performance on downstream vision tasks compared to autoregressive and contrastive approaches.
Business Value
Enables the development of more powerful AI models for image understanding and generation, leading to better image search, content moderation, and automated image description services.