DiffVC: A Non-autoregressive Framework Based on Diffusion Model for Video Captioning

Published 9 Apr 2026 in cs.CV | (2604.08084v1)

Abstract: Current video captioning methods usually use an encoder-decoder structure to generate text autoregressively. However, autoregressive methods have inherent limitations such as slow generation speed and large cumulative error. Furthermore, the few non-autoregressive counterparts suffer from deficiencies in generation quality due to the lack of sufficient multimodal interaction modeling. Therefore, we propose a non-autoregressive framework based on Diffusion model for Video Captioning (DiffVC) to address these issues. Its parallel decoding can effectively solve the problems of generation speed and cumulative error. At the same time, our proposed discriminative conditional Diffusion Model can generate higher-quality textual descriptions. Specifically, we first encode the video into a visual representation. During training, Gaussian noise is added to the textual representation of the ground-truth caption. Then, a new textual representation is generated via the discriminative denoiser with the visual representation as a conditional constraint. Finally, we input the new textual representation into a non-autoregressive LLM to generate captions. During inference, we directly sample noise from the Gaussian distribution for generation. Experiments on MSVD, MSR-VTT, and VATEX show that our method can outperform previous non-autoregressive methods and achieve comparable performance to autoregressive methods, e.g., it achieved a maximum improvement of 9.9 on the CIDEr and improvement of 2.6 on the B@4, while having faster generation speed. The source code will be available soon.

Abstract PDF Upgrade to Chat

Authors (7)

Summary

The paper presents a discriminative diffusion paradigm that enhances non-autoregressive video captioning by mitigating slow inference and error propagation.
It employs a three-stage process combining video and text encoding, noise-based discriminative denoising with cross-modal attention, and transformer-driven parallel decoding.
Experimental results demonstrate improvements up to +9.9 CIDEr and +2.6 BLEU@4 on benchmarks like MSR-VTT, highlighting its practical efficiency and quality.

DiffVC: A Diffusion-Based Non-Autoregressive Framework for Video Captioning

Motivation and Background

Video captioning operations traditionally employ autoregressive paradigms, characterized by sequential word generation conditioned on previous tokens and visual representations. While autoregression yields semantically coherent captions, it is hampered by slow generation speeds—particularly for longer sequences—and accumulative error propagation. Existing non-autoregressive captioners offer faster inference by generating all words in parallel, but suffer from semantic incompleteness and diminished generation quality, largely due to insufficient modeling of inter-modal interactions between vision and language.

Diffusion models have demonstrated efficacy in generative tasks involving continuous data modalities (e.g., images, audio), outperforming GANs and VAEs in stability and output diversity. However, their application to discrete sequence data has remained limited. DiffVC addresses this gap by introducing a discriminative conditional diffusion paradigm for non-autoregressive video captioning, optimizing both inference speed and generation quality.

Methodological Overview

DiffVC consists of a three-stage pipeline:

Video and Text Representation Encoding: The visual encoder, pre-trained on RSFD, transforms the input video into a spatiotemporal embedding. Simultaneously, the text encoder maps the ground-truth caption to a continuous textual representation.
Diffusion and Discriminative Denoising: Gaussian noise is progressively added during training to textual embeddings, producing a temporal sequence of noisy representations. A discriminative denoiser, leveraging cross-attention between noisy textual queries and visual-key/value pairs, reconstructs noise-free textual representations while explicitly preserving conditional semantics.

(Figure 1)

Figure 1: DiffVC overall architecture including discriminative denoising and non-autoregressive decoding.

Non-Autoregressive Language Modeling: The denoised textual representation is decoded via a transformer-based non-autoregressive LLM, generating all tokens in parallel, constrained by maximum sequence length.

Inference deviates from the training path: random noise sampled from a Gaussian distribution (in absence of ground-truth captions) is fed to the discriminative denoiser, which, conditioned on the visual embedding, generates an initial textual representation for the language decoder.

Architectural Innovations

The discriminative denoiser is central to DiffVC’s performance. Unlike classic concatenation-based cross-attention (e.g., [CLS]+text tokens), DiffVC employs distinct pathways for queries (text) and keys/values (visual), thus avoiding dilution of visual context and enhancing cross-modal alignment. Ablation studies reveal that the removal of the discriminative denoiser leads to a marked decrease in generation metrics.

Additionally, the non-autoregressive LLM is crucial. An ablation where denoiser outputs are projected directly to the vocabulary layer yields significantly inferior captions, validating the necessity of deep transformer refinement for parallel decoding.

Experimental Analysis

Quantitative Results

DiffVC achieves state-of-the-art results among non-autoregressive methods across MSVD, MSR-VTT, and VATEX benchmarks, with performance comparable to the best autoregressive architectures:

On MSR-VTT, DiffVC surpasses several autoregressive methods with CIDEr up to 56.7 and BLEU@4 up to 44.5.
Maximum improvement over existing non-autoregressive methods reaches +9.9 CIDEr and +2.6 BLEU@4.
DiffVC substantially improves generation quality for long sequences while exhibiting notable speed advantages.

(Figure 2)

Figure 2: Generation speed comparison between non-autoregressive and autoregressive captioners.

Figure 3: Comparison on generation quality between non-autoregressive and autoregressive video captioning.

Qualitative Results

Case studies demonstrate DiffVC's ability to generate semantically complete and precise descriptions, identifying low-frequency visual entities and maintaining fidelity across differentiable semantic concepts (e.g., distinguishing between "girl" and "woman", as shown in the qualitative samples).

(Figure 4)

Figure 4: Case study on MSR-VTT, highlighting DiffVC’s semantic completeness and accuracy.

Ablation Findings

Discriminative Denoiser: Its removal degrades performance, confirming its necessity for robust inter-modal modeling.
Denoiser Depth / LM Blocks: Optimal performance is achieved with 12 denoiser blocks and 6 LM blocks; exceeding these values causes diminishing returns.
Inference Steps: Increasing timesteps improves output quality but at computational cost, with 20 steps set as optimal balance.

Practical and Theoretical Implications

Practically, DiffVC enables fast, high-quality caption generation suitable for real-time applications in video retrieval, recommendation, and automated content annotation. Theoretically, it establishes a robust pathway for diffusion models in discrete sequence generation, showcasing the benefit of discriminative cross-modal denoising in overcoming semantic deficiencies commonly associated with non-autoregressive methods. The paradigm provides a template for scaling diffusion-based language generation tasks beyond captioning.

Speculation on Future Directions

Extension to Multilingual Captioning: Incorporating multilingual decoders may leverage DiffVC in cross-lingual settings.
Hierarchical or Multi-stage Denoising: More granular modeling of temporal dependencies (e.g., hierarchical visual conditioning) could further boost semantic correctness.
Integration with LLMs: Coupling DiffVC with large pre-trained transformers could enhance generalization, domain adaptation, and robustness.
Application to Other Discrete Generation Tasks: Expansion to dialogue generation, text-to-video synthesis, and story generation can exploit DiffVC's denoising architecture.

Conclusion

DiffVC introduces a discriminative diffusion-based non-autoregressive video captioning system, resolving the speed and error accumulation challenges of autoregressive paradigms while preserving generation quality through explicit cross-modal modeling. Its architecture and empirical results validate the diffusion paradigm for discrete sequence tasks constrained by multimodal data, opening new avenues for efficient and semantically rich video-language generation in practical deployments.

Markdown Report Issue