Masked Diffusion Captioning for Visual Feature Learning (2510.26799v1)

Published 30 Oct 2025 in cs.CV

Abstract: We learn visual features by captioning images with an image-conditioned masked diffusion LLM, a formulation we call masked diffusion captioning (MDC). During training, text tokens in each image-caption pair are masked at a randomly chosen ratio, and a decoder conditioned on visual features is trained to reconstruct the original text. After training, the learned visual features can be applied to downstream vision tasks. Unlike autoregressive captioning, the strength of the visual learning signal in MDC does not depend on each token's position in the sequence, reducing the need for auxiliary objectives. Linear probing experiments across a variety of academic-scale models and datasets show that the learned visual features are competitive with those produced by autoregressive and contrastive approaches.

Summary

The paper introduces a novel image-conditioned masked diffusion language model (MDC) that overcomes the asymmetry in autoregressive captioning by providing consistent visual supervision.
The methodology combines a vision encoder with a Transformer decoder using a randomized masking ratio and unified noise schedule, achieving competitive performance on benchmarks like ImageNet and MSCOCO.
Implications include simplified training and tuning, robust transferability to downstream tasks, and enhanced compositional reasoning in visual recognition and captioning.

Masked Diffusion Captioning for Visual Feature Learning

Introduction and Motivation

The paper introduces Masked Diffusion Captioning (MDC), a novel approach for visual feature learning via image-conditioned masked diffusion LLMs. The motivation stems from the limitations of autoregressive captioning objectives, which provide an asymmetric learning signal: as the sequence progresses, the model increasingly relies on previously generated tokens rather than the image, diminishing the utility of the visual input for later tokens. MDC addresses this by leveraging masked diffusion language modeling, where the masking ratio is randomly sampled per training instance, ensuring a position-independent and consistent visual supervision signal across all tokens.

Figure 1: Learning visual features by masked diffusion language modeling. Visual features are learned by captioning images using an image-conditioned masked diffusion LLM, and the encoder's features are transferable to downstream computer vision tasks.

Methodology

Masked Diffusion Language Modeling

MDC builds upon the framework of masked diffusion LLMs (MDLMs), which generalize BERT-style masked language modeling into a generative process. In MDLMs, a forward process gradually corrupts the input sequence by masking tokens according to a noise schedule parameterized by $t$ . The reverse process reconstructs the original sequence, and the model is trained to maximize the likelihood of the clean sequence given the corrupted input.

Image-Conditioned Captioning Architecture

The MDC architecture consists of a vision encoder $f_\phi$ and a Transformer-based text decoder $g_\psi$ . The encoder extracts visual features from the input image, which are fused into the decoder via cross-attention. During training, a random masking ratio $t$ is sampled, and the corresponding tokens in the caption are masked. The decoder is then tasked with reconstructing the original caption, conditioned on both the unmasked tokens and the visual features.

Figure 2: Learning visual features using masked diffusion captioning. (a) Training involves random masking of caption tokens and reconstruction via a visual feature-conditioned decoder. (b) Sampling starts from a fully masked sequence and iteratively denoises to produce a full caption.

The loss function incorporates a scaling factor dependent on $t$ , which weights the cross-entropy loss for each masked token. This design ensures that the model learns to reconstruct captions from varying degrees of corruption, promoting robust feature learning.

Sampling and Decoding

At inference, MDC employs a greedy denoising strategy: starting from a fully masked sequence, the model iteratively reveals the token with the highest confidence at each step, until all tokens are unmasked. This approach avoids the numerical instability associated with Gumbel-based categorical sampling and ensures efficient caption generation.

Experimental Evaluation

Datasets and Pretraining

MDC is pretrained on several large-scale vision-language datasets, including CC3M, CC12M, and Recap-DataComp. Caption length distributions are visualized to highlight the diversity and richness of the textual descriptions.

Figure 3: Dataset caption length distribution for CC3M, CC12M, and Recap-DataComp, illustrating the variability in caption lengths across datasets.

Linear Probing for Visual Feature Quality

Visual features learned by MDC are evaluated via linear probing on standard recognition benchmarks (ImageNet-1k, Food101, CIFAR-10/100, Pets). MDC consistently achieves accuracy competitive with autoregressive captioning and contrastive methods (e.g., CLIP), especially when trained on datasets with rich textual descriptions. Notably, MDC demonstrates strong performance even as the number of image-text pairs increases, indicating favorable scaling properties.

Figure 4: Linear probing performance with varying numbers of image–text pairs, showing improved accuracy on IN-1K as dataset size increases.

Comparison with Masked LLM Variants

MDC is compared against image-conditioned BERT models with fixed masking ratios. While BERT with high masking ratios can approach MDC's performance, MDC's unified time-based schedule obviates the need for dataset-specific tuning and consistently yields robust results.

Figure 5: Comparison to image-conditioned BERT with different masking ratios. MDC avoids the need for masking ratio tuning and achieves superior or comparable performance.

Vision-Language Compositionality

MDC's ability to match images to captions is evaluated on compositionality benchmarks (ARO, SugarCrepe). Using the model's likelihood estimates or a heuristic denoising-based score, MDC outperforms both CLIP and autoregressive captioning in compositionality tasks, indicating that the learned representations capture complex image-text relationships beyond simple bag-of-words matching.

Image Captioning Quality

MDC is finetuned for image captioning on MSCOCO and Flickr30k. Despite the constraint of fixed output sequence length, MDC generates coherent and descriptive captions, as confirmed by both automatic metrics and LLM-based human preference evaluations.

Figure 6: Examples of captioning results from MSCOCO Karpathy-test split, illustrating MDC's ability to generate captions of varying lengths.

Analysis of Design Choices

Ablation studies confirm the necessity of the $t$ -dependent loss scaling factor for effective feature learning. The choice of noise schedule (masking ratio interval) is critical, with higher lower bounds (e.g., $[0.5, 1.0]$ ) yielding more stable training and better performance, especially for datasets with short captions.

Implications and Future Directions

MDC demonstrates that masked diffusion language modeling is a viable and competitive alternative to autoregressive and contrastive approaches for visual representation learning. The position-independent supervision signal and unified noise schedule simplify training and tuning, while enabling robust feature extraction and compositionality. The method scales favorably with dataset size and is adaptable to various encoder architectures.

Practically, MDC can be integrated into multimodal systems for tasks requiring strong visual representations, such as zero-shot recognition, compositional reasoning, and image captioning. Theoretically, the work suggests that generative masked modeling objectives can match or surpass discriminative and autoregressive paradigms in vision-language learning, especially when equipped with flexible noise schedules and efficient sampling strategies.

Future research may explore scaling MDC to larger model and dataset sizes, extending the approach to joint vision-language modeling (e.g., early fusion), and investigating its applicability to other modalities and generative tasks. Further analysis of the trade-offs between caption length, masking schedule, and downstream performance will be valuable for optimizing MDC in practical deployments.

Conclusion

Masked Diffusion Captioning provides a principled and effective framework for learning visual features from image-caption pairs. By leveraging masked diffusion language modeling with a unified noise schedule, MDC achieves competitive performance in visual recognition, compositionality, and caption generation, while simplifying training and hyperparameter tuning. The approach offers a promising direction for future multimodal representation learning research and applications.