Unified multimodal Transformer pipelines for discriminative and generative tasks

Develop unified Transformer-based multimodal pretraining architectures that directly support both discriminative tasks (such as visual question answering and retrieval) and generative tasks (such as captioning), avoiding the need for separate decoders and mitigating pretrain–finetune discrepancies.

Background

BERT-style cross-modal pretraining models excel in down-stream understanding tasks but typically cannot be applied directly to generative tasks. For example, VideoBERT and CBT must train a separate video-to-text decoder for captioning, indicating a gap between encoder-only designs and generation workflows.

A unified pipeline that seamlessly handles both task types would reduce architectural fragmentation and improve consistency between pretraining and fine-tuning.

References

Therefore, how to design more unified pipelines that can work for both discriminative and generative down-stream tasks is also an open problem to be solved.

— Multimodal Learning with Transformers: A Survey (2206.06488 - Xu et al., 2022) in Discussion under Subsubsection "Task-Agnostic Multimodal Pretraining" (Section 4.1.1)

Unified multimodal Transformer pipelines for discriminative and generative tasks

Sponsor

Background

References

Related Problems