Multimodal C4: An Open, Billion-scale Corpus of Images Interleaved with Text (2304.06939v3)
Abstract: In-context vision and LLMs like Flamingo support arbitrarily interleaved sequences of images and text as input. This format not only enables few-shot learning via interleaving independent supervised (image, text) examples, but also, more complex prompts involving interaction between images, e.g., "What do image A and image B have in common?" To support this interface, pretraining occurs over web corpora that similarly contain interleaved images+text. To date, however, large-scale data of this form have not been publicly available. We release Multimodal C4, an augmentation of the popular text-only C4 corpus with images interleaved. We use a linear assignment algorithm to place images into longer bodies of text using CLIP features, a process that we show outperforms alternatives. Multimodal C4 spans everyday topics like cooking, travel, technology, etc. A manual inspection of a random sample of documents shows that a vast majority (88%) of images are topically relevant, and that linear assignment frequently selects individual sentences specifically well-aligned with each image (80%). After filtering NSFW images, ads, etc., the resulting corpus consists of 101.2M documents with 571M images interleaved in 43B English tokens.
- Wanrong Zhu (30 papers)
- Jack Hessel (50 papers)
- Anas Awadalla (12 papers)
- Samir Yitzhak Gadre (12 papers)
- Jesse Dodge (45 papers)
- Alex Fang (13 papers)
- Youngjae Yu (72 papers)
- Ludwig Schmidt (80 papers)
- William Yang Wang (254 papers)
- Yejin Choi (287 papers)