Discovering intrinsically well-aligned cross-modal corpora for pretraining
Identify and construct large-scale datasets that provide intrinsically well-aligned cross-modal supervision (for example, instructional videos with tightly aligned visual content and spoken words) to enable effective Transformer-based multimodal pretraining without incurring prohibitive labeling and alignment costs.
References
How to look for more corpora that intrinsically have well-aligned cross-modal supervision, such as instructional videos, is still an open problem.
— Multimodal Learning with Transformers: A Survey
(2206.06488 - Xu et al., 2022) in Discussion under Subsubsection "Task-Agnostic Multimodal Pretraining" (Section 4.1.1)