M2-Encoder: Advancing Bilingual Image-Text Understanding by Large-scale Efficient Pretraining (2401.15896v2)
Abstract: Vision-language foundation models like CLIP have revolutionized the field of artificial intelligence. Nevertheless, VLM models supporting multi-language, e.g., in both Chinese and English, have lagged due to the relative scarcity of large-scale pretraining datasets. Toward this end, we introduce a comprehensive bilingual (Chinese-English) dataset BM-6B with over 6 billion image-text pairs, aimed at enhancing multimodal foundation models to well understand images in both languages. To handle such a scale of dataset, we propose a novel grouped aggregation approach for image-text contrastive loss computation, which reduces the communication overhead and GPU memory demands significantly, facilitating a 60% increase in training speed. We pretrain a series of bilingual image-text foundation models with an enhanced fine-grained understanding ability on BM-6B, the resulting models, dubbed as $M2$-Encoders (pronounced "M-Square"), set new benchmarks in both languages for multimodal retrieval and classification tasks. Notably, Our largest $M2$-Encoder-10B model has achieved top-1 accuracies of 88.5% on ImageNet and 80.7% on ImageNet-CN under a zero-shot classification setting, surpassing previously reported SoTA methods by 2.2% and 21.1%, respectively. The $M2$-Encoder series represents one of the most comprehensive bilingual image-text foundation models to date, so we are making it available to the research community for further exploration and development.
- Qwen-vl: A frontier large vision-language model with versatile abilities.
- Coyo-700m: Image-text pair dataset. https://github.com/kakaobrain/coyo-dataset.
- Wukong: A scalable and locality-enhanced framework for serverless parallel computing. In Proceedings of the 11th ACM symposium on cloud computing, pages 1–15.
- Hico: A benchmark for recognizing human-object interactions in images. In Proceedings of the IEEE international conference on computer vision, pages 1017–1025.
- Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325.
- Altclip: Altering the language encoder in clip for extended language capabilities. arXiv preprint arXiv:2211.06679.
- Reproducible scaling laws for contrastive language-image learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2818–2829.
- Instructblip: Towards general-purpose vision-language models with instruction tuning.
- Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee.
- Datacomp: In search of the next generation of multimodal datasets.
- Multimodal-gpt: A vision and language model for dialogue with humans.
- Wukong: A 100 million large-scale chinese cross-modal pre-training benchmark. Advances in Neural Information Processing Systems, 35:26418–26431.
- Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000–16009.
- Wenlan: Bridging vision and language by large-scale multi-modal pre-training. arXiv preprint arXiv:2103.06561.
- Scaling up visual and vision-language representation learning with noisy text supervision. In International conference on machine learning, pages 4904–4916. PMLR.
- Novel dataset for fine-grained image categorization: Stanford dogs. In Proc. CVPR workshop on fine-grained visual categorization (FGVC), volume 2. Citeseer.
- 3d object representations for fine-grained categorization. In Proceedings of the IEEE international conference on computer vision workshops, pages 554–561.
- Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision, 123:32–73.
- Fluency-guided cross-lingual image captioning. In Proceedings of the 25th ACM international conference on Multimedia, pages 1549–1557.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models.
- Reclip: Resource-efficient clip by training with small images.
- Coco-cn for cross-lingual image tagging, captioning, and retrieval. IEEE Transactions on Multimedia, 21(9):2347–2360.
- M6-10t: A sharing-delinking paradigm for efficient multi-trillion parameter pretraining.
- Visual instruction tuning.
- Taisu: A 166m large-scale high-quality dataset for chinese vision-language pre-training. Advances in Neural Information Processing Systems, 35:16705–16717.
- Taiyi: A bilingual fine-tuned large language model for diverse biomedical tasks. arXiv preprint arXiv:2311.11608.
- Sycoca: Symmetrizing contrastive captioners with attentive masking for multimodal alignment.
- Improving multimodal datasets with image captioning.
- Combined scaling for open-vocabulary image classification. arXiv e-prints, pages arXiv–2111.
- Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In Proceedings of the IEEE international conference on computer vision, pages 2641–2649.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR.
- Hierarchical text-conditional image generation with clip latents, 2022. URL https://arxiv. org/abs/2204.06125, 7.
- Zero-shot text-to-image generation. In International Conference on Machine Learning, pages 8821–8831. PMLR.
- High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695.
- Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems, 35:25278–25294.
- Ernie-vil 2.0: Multi-view contrastive learning for image-text pre-training. arXiv preprint arXiv:2209.15270.
- Umg-clip: A unified multi-granularity vision generalist for open-world understanding.
- Flava: A foundational language and vision alignment model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15638–15650.
- Eva-clip: Improved training techniques for clip at scale. arXiv preprint arXiv:2303.15389.
- The caltech-ucsd birds-200-2011 dataset.
- Magneto: a foundation transformer. In International Conference on Machine Learning, pages 36077–36092. PMLR.
- Image as a foreign language: Beit pretraining for all vision and vision-language tasks. arXiv preprint arXiv:2208.10442.
- Rethinking memory and communication cost for efficient large language model training.
- Ccmb: A large-scale chinese cross-modal benchmark. In Proceedings of the 31st ACM International Conference on Multimedia, pages 4219–4227.
- Zero and r2d2: A large-scale chinese cross-modal benchmark and a vision-language framework. arXiv preprint arXiv:2205.03860.
- Chinese clip: Contrastive vision-language pretraining in chinese.
- Filip: Fine-grained interactive language-image pre-training. arXiv preprint arXiv:2111.07783.
- mplug-owl: Modularization empowers large language models with multimodality.
- Large batch optimization for deep learning: Training bert in 76 minutes. arXiv preprint arXiv:1904.00962.
- Coca: Contrastive captioners are image-text foundation models. arXiv preprint arXiv:2205.01917.
- Florence: A new foundation model for computer vision. arXiv preprint arXiv:2111.11432.
- Lit: Zero-shot transfer with locked-image text tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18123–18133.
- Minigpt-4: Enhancing vision-language understanding with advanced large language models.
- Qingpei Guo (27 papers)
- Furong Xu (22 papers)
- Hanxiao Zhang (24 papers)
- Wang Ren (4 papers)
- Ziping Ma (4 papers)
- Lin Ju (10 papers)
- Jian Wang (966 papers)
- Jingdong Chen (61 papers)
- Ming Yang (289 papers)