Unified multimodal Transformer pipelines for discriminative and generative tasks
Develop unified Transformer-based multimodal pretraining architectures that directly support both discriminative tasks (such as visual question answering and retrieval) and generative tasks (such as captioning), avoiding the need for separate decoders and mitigating pretrain–finetune discrepancies.
References
Therefore, how to design more unified pipelines that can work for both discriminative and generative down-stream tasks is also an open problem to be solved.
— Multimodal Learning with Transformers: A Survey
(2206.06488 - Xu et al., 2022) in Discussion under Subsubsection "Task-Agnostic Multimodal Pretraining" (Section 4.1.1)