Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_cv 93% Match Research Paper Researchers in Computer Vision and NLP,Machine Learning Engineers,AI Scientists,Developers of multimodal AI systems 5 days ago

Masked Diffusion Captioning for Visual Feature Learning

large-language-models › multimodal-llms
📄 Abstract

Abstract: We learn visual features by captioning images with an image-conditioned masked diffusion language model, a formulation we call masked diffusion captioning (MDC). During training, text tokens in each image-caption pair are masked at a randomly chosen ratio, and a decoder conditioned on visual features is trained to reconstruct the original text. After training, the learned visual features can be applied to downstream vision tasks. Unlike autoregressive captioning, the strength of the visual learning signal in MDC does not depend on each token's position in the sequence, reducing the need for auxiliary objectives. Linear probing experiments across a variety of academic-scale models and datasets show that the learned visual features are competitive with those produced by autoregressive and contrastive approaches.
Authors (3)
Chao Feng
Zihao Wei
Andrew Owens
Submitted
October 30, 2025
arXiv Category
cs.CV
arXiv PDF

Key Contributions

This paper proposes Masked Diffusion Captioning (MDC), a novel approach to learn visual features by using an image-conditioned masked diffusion language model to caption images. MDC masks text tokens during training and trains a decoder to reconstruct them based on visual features, decoupling the learning signal from token position. This method achieves competitive performance on downstream vision tasks compared to autoregressive and contrastive approaches.

Business Value

Enables the development of more powerful AI models for image understanding and generation, leading to better image search, content moderation, and automated image description services.