M6: A Chinese Multimodal Pretrainer
This paper presents the development and evaluation of M6, a large-scale Chinese multimodal pretraining model. The authors construct the largest dataset for Chinese multimodal pretraining to date, incorporating over 1.9TB of images and 292GB of text. This wide-ranging dataset, named M6-Corpus, spans multiple domains including encyclopedic entries, forum discussions, and e-commerce data, facilitating a deep understanding of both single-modality and cross-modality content.
M6 employs a novel pretraining approach designed to process large amounts of data across modalities using a unified model architecture. The model itself is scaled to significant sizes with 10 billion and 100 billion parameters for the M6-10B and M6-100B variants, respectively. The inclusion of Mixture-of-Experts (MoE) architectures allows M6-100B to efficiently handle this scale through sparse activation, reducing the complexity traditionally associated with models of this magnitude.
The paper presents several key contributions:
- Dataset Construction: The M6-Corpus is introduced as a substantial resource for Chinese multimodal research. It establishes a benchmark for future studies by encompassing not only textual data but also detailed image-text pairs.
- Model Architecture: The M6 model integrates both encoder and decoder functionalities into a single framework using a transformer-based architecture. This allows the model to perform tasks such as text-to-text and image-to-text generation using shared resources, enhancing training efficiency and model versatility.
- Scalability: The authors implement extensive training infrastructure improvements to support large-scale model training. This includes leveraging distributed training techniques and optimizing communications for MoE structures, enabling the scale-up to 100 billion parameters.
- Performance: M6 demonstrates competitive performance across various tasks, outperforming strong baselines in tasks such as Visual Question Answering (VQA), image captioning, and image-text matching. Notably, M6 shows an 11.8% improvement in VQA accuracy and a 10.3% improvement in image-text matching against comparable models.
- Text-to-Image Generation: An innovative contribution is the application of M6 in text-to-image generation, where the model successfully generates high-quality and detail-rich images from textual descriptions. This capability opens new avenues for creative applications within design and e-commerce.
The implications of this paper are multifaceted. Practically, M6 can be directly applied to various industries such as e-commerce, enhancing product description generation and customer interaction capabilities. Theoretically, the work demonstrates the potential of cross-modal pretraining at scale, setting a precedent for future multimodal AI developments, particularly in non-English contexts.
Future work may involve expanding the dataset further, refining the pretraining tasks to better leverage cross-modal information, and improving training efficiency and model interpretability to expedite the deployment of such large-scale models in more practical settings. The work with M6 is a significant step towards harnessing large-scale data for multimodal AI systems and demonstrates the potential impact of such technologies in diverse applications.