Discovering intrinsically well-aligned cross-modal corpora for pretraining

Identify and construct large-scale datasets that provide intrinsically well-aligned cross-modal supervision (for example, instructional videos with tightly aligned visual content and spoken words) to enable effective Transformer-based multimodal pretraining without incurring prohibitive labeling and alignment costs.

Background

Transformer-based multimodal pretraining has relied heavily on well-aligned image–text pairs and instructional video corpora where visual content and spoken words tend to align. However, such data are costly to collect and annotate at scale.

The authors highlight that identifying sources of intrinsically aligned multimodal supervision remains challenging, and emphasize the importance of finding more scalable corpora that can support robust cross-modal learning.

References

How to look for more corpora that intrinsically have well-aligned cross-modal supervision, such as instructional videos, is still an open problem.

— Multimodal Learning with Transformers: A Survey (2206.06488 - Xu et al., 2022) in Discussion under Subsubsection "Task-Agnostic Multimodal Pretraining" (Section 4.1.1)

Discovering intrinsically well-aligned cross-modal corpora for pretraining

Background

References

Related Problems