arxiv_cv 93% Match Research Paper Researchers in Computer Vision and NLP,Machine Learning Engineers,AI Scientists,Developers of multimodal AI systems 5 days ago

Masked Diffusion Captioning for Visual Feature Learning

large-language-models › multimodal-llms

📄 Abstract

Abstract: We learn visual features by captioning images with an image-conditioned masked diffusion language model, a formulation we call masked diffusion captioning (MDC). During training, text tokens in each image-caption pair are masked at a randomly chosen ratio, and a decoder conditioned on visual features is trained to reconstruct the original text. After training, the learned visual features can be applied to downstream vision tasks. Unlike autoregressive captioning, the strength of the visual learning signal in MDC does not depend on each token's position in the sequence, reducing the need for auxiliary objectives. Linear probing experiments across a variety of academic-scale models and datasets show that the learned visual features are competitive with those produced by autoregressive and contrastive approaches.

Authors (3)

Chao Feng

Zihao Wei

Andrew Owens

Submitted

October 30, 2025

arXiv Category

cs.CV

arXiv PDF

Key Contributions

This paper proposes Masked Diffusion Captioning (MDC), a novel approach to learn visual features by using an image-conditioned masked diffusion language model to caption images. MDC masks text tokens during training and trains a decoder to reconstruct them based on visual features, decoupling the learning signal from token position. This method achieves competitive performance on downstream vision tasks compared to autoregressive and contrastive approaches.

Business Value

Enables the development of more powerful AI models for image understanding and generation, leading to better image search, content moderation, and automated image description services.

Paper Metadata

Innovation Type

Novel Training Formulation for Visual Feature Learning

Deployment Feasibility

Moderate, requires significant computational resources for training diffusion models and large language models.

Limitations Addressed

Dependence of visual learning signal on token position in autoregressive captioning,Need for auxiliary objectives in traditional captioning models,Learning robust visual features applicable to diverse downstream tasks

Performance Gains

Competitive with autoregressive and contrastive approaches on downstream vision tasks.

Technical Tags

Masked Diffusion CaptioningVisual Feature LearningImage CaptioningDiffusion ModelsLanguage ModelsText GenerationVisual-Linguistic LearningDownstream Vision TasksLinear ProbingGenerative Models

Research Topics

Multimodal LearningVisual Representation LearningGenerative ModelsImage CaptioningLanguage Models

Methods & Architectures

Masked Diffusion Captioning (MDC)Image-conditioned masked diffusion language modelDecoder-based reconstructionLinear probing for evaluation Diffusion ModelsLanguage Models (Transformer-based)

Applications & Tasks

Computer Vision Natural Language Processing Image Understanding Content Generation Learning effective visual features for downstream tasksImproving image captioning modelsReducing reliance on positional information in text generationDeveloping generative models for visual-linguistic tasks Visual feature learningImage captioningReconstructing masked text tokensDownstream vision tasks (e.g., classification, detection)

Datasets & Benchmarks

Benchmarks

academic-scale models and datasets

Linear probing performance

Related Fields

Computer VisionNatural Language ProcessingGenerative AIDeep LearningMultimodal AI

Keywords

Masked Diffusion CaptioningVisual FeaturesImage CaptioningDiffusion ModelsLanguage ModelsMultimodal LearningGenerative AIComputer VisionNLPRepresentation LearningLinear ProbingText GenerationAI

Academic Context

#Multimodal Learning#Visual Representation Learning#Generative Models#Image Captioning#Language Models

Commercial Potential

Potential Products

Advanced image understanding modelsAI-powered content generation toolsImproved image search enginesAutomated report generation systems

Target Industries

TechnologyMediaE-commerceSocial MediaHealthcare (for image analysis)

Use Case Examples

Generating descriptive captions for images in a large datasetImproving the performance of image classification modelsEnabling AI to understand and describe visual contentPowering visual search functionalities

Competitive Edge

Offers a novel training paradigm for visual feature learning that is competitive with established methods like autoregressive and contrastive learning.

Market Opportunity

Large, driven by the demand for advanced AI capabilities in vision and language.

Revenue Models

Licensing of trained modelsAPI access for feature extractionDevelopment of specialized AI services

Resource Requirements

Compute Needs

High, requires significant GPU resources for training diffusion and language models.

Data Requirements

Requires large datasets of image-caption pairs.

Deployment Constraints

Computational cost of inference,Model size

Scalability

Scalability depends on the underlying diffusion and language model architectures.

Production Readiness

Maturity Level

Research and development.

Time to Market

Medium term.

Patent Potential

Moderate, related to the novel MDC formulation or specific architectural improvements.

View Full Paper Back to Papers