Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
51 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

M2-Encoder: Advancing Bilingual Image-Text Understanding by Large-scale Efficient Pretraining (2401.15896v2)

Published 29 Jan 2024 in cs.CV and cs.AI

Abstract: Vision-language foundation models like CLIP have revolutionized the field of artificial intelligence. Nevertheless, VLM models supporting multi-language, e.g., in both Chinese and English, have lagged due to the relative scarcity of large-scale pretraining datasets. Toward this end, we introduce a comprehensive bilingual (Chinese-English) dataset BM-6B with over 6 billion image-text pairs, aimed at enhancing multimodal foundation models to well understand images in both languages. To handle such a scale of dataset, we propose a novel grouped aggregation approach for image-text contrastive loss computation, which reduces the communication overhead and GPU memory demands significantly, facilitating a 60% increase in training speed. We pretrain a series of bilingual image-text foundation models with an enhanced fine-grained understanding ability on BM-6B, the resulting models, dubbed as $M2$-Encoders (pronounced "M-Square"), set new benchmarks in both languages for multimodal retrieval and classification tasks. Notably, Our largest $M2$-Encoder-10B model has achieved top-1 accuracies of 88.5% on ImageNet and 80.7% on ImageNet-CN under a zero-shot classification setting, surpassing previously reported SoTA methods by 2.2% and 21.1%, respectively. The $M2$-Encoder series represents one of the most comprehensive bilingual image-text foundation models to date, so we are making it available to the research community for further exploration and development.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (53)
  1. Qwen-vl: A frontier large vision-language model with versatile abilities.
  2. Coyo-700m: Image-text pair dataset. https://github.com/kakaobrain/coyo-dataset.
  3. Wukong: A scalable and locality-enhanced framework for serverless parallel computing. In Proceedings of the 11th ACM symposium on cloud computing, pages 1–15.
  4. Hico: A benchmark for recognizing human-object interactions in images. In Proceedings of the IEEE international conference on computer vision, pages 1017–1025.
  5. Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325.
  6. Altclip: Altering the language encoder in clip for extended language capabilities. arXiv preprint arXiv:2211.06679.
  7. Reproducible scaling laws for contrastive language-image learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2818–2829.
  8. Instructblip: Towards general-purpose vision-language models with instruction tuning.
  9. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee.
  10. Datacomp: In search of the next generation of multimodal datasets.
  11. Multimodal-gpt: A vision and language model for dialogue with humans.
  12. Wukong: A 100 million large-scale chinese cross-modal pre-training benchmark. Advances in Neural Information Processing Systems, 35:26418–26431.
  13. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000–16009.
  14. Wenlan: Bridging vision and language by large-scale multi-modal pre-training. arXiv preprint arXiv:2103.06561.
  15. Scaling up visual and vision-language representation learning with noisy text supervision. In International conference on machine learning, pages 4904–4916. PMLR.
  16. Novel dataset for fine-grained image categorization: Stanford dogs. In Proc. CVPR workshop on fine-grained visual categorization (FGVC), volume 2. Citeseer.
  17. 3d object representations for fine-grained categorization. In Proceedings of the IEEE international conference on computer vision workshops, pages 554–561.
  18. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision, 123:32–73.
  19. Fluency-guided cross-lingual image captioning. In Proceedings of the 25th ACM international conference on Multimedia, pages 1549–1557.
  20. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models.
  21. Reclip: Resource-efficient clip by training with small images.
  22. Coco-cn for cross-lingual image tagging, captioning, and retrieval. IEEE Transactions on Multimedia, 21(9):2347–2360.
  23. M6-10t: A sharing-delinking paradigm for efficient multi-trillion parameter pretraining.
  24. Visual instruction tuning.
  25. Taisu: A 166m large-scale high-quality dataset for chinese vision-language pre-training. Advances in Neural Information Processing Systems, 35:16705–16717.
  26. Taiyi: A bilingual fine-tuned large language model for diverse biomedical tasks. arXiv preprint arXiv:2311.11608.
  27. Sycoca: Symmetrizing contrastive captioners with attentive masking for multimodal alignment.
  28. Improving multimodal datasets with image captioning.
  29. Combined scaling for open-vocabulary image classification. arXiv e-prints, pages arXiv–2111.
  30. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In Proceedings of the IEEE international conference on computer vision, pages 2641–2649.
  31. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR.
  32. Hierarchical text-conditional image generation with clip latents, 2022. URL https://arxiv. org/abs/2204.06125, 7.
  33. Zero-shot text-to-image generation. In International Conference on Machine Learning, pages 8821–8831. PMLR.
  34. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695.
  35. Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems, 35:25278–25294.
  36. Ernie-vil 2.0: Multi-view contrastive learning for image-text pre-training. arXiv preprint arXiv:2209.15270.
  37. Umg-clip: A unified multi-granularity vision generalist for open-world understanding.
  38. Flava: A foundational language and vision alignment model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15638–15650.
  39. Eva-clip: Improved training techniques for clip at scale. arXiv preprint arXiv:2303.15389.
  40. The caltech-ucsd birds-200-2011 dataset.
  41. Magneto: a foundation transformer. In International Conference on Machine Learning, pages 36077–36092. PMLR.
  42. Image as a foreign language: Beit pretraining for all vision and vision-language tasks. arXiv preprint arXiv:2208.10442.
  43. Rethinking memory and communication cost for efficient large language model training.
  44. Ccmb: A large-scale chinese cross-modal benchmark. In Proceedings of the 31st ACM International Conference on Multimedia, pages 4219–4227.
  45. Zero and r2d2: A large-scale chinese cross-modal benchmark and a vision-language framework. arXiv preprint arXiv:2205.03860.
  46. Chinese clip: Contrastive vision-language pretraining in chinese.
  47. Filip: Fine-grained interactive language-image pre-training. arXiv preprint arXiv:2111.07783.
  48. mplug-owl: Modularization empowers large language models with multimodality.
  49. Large batch optimization for deep learning: Training bert in 76 minutes. arXiv preprint arXiv:1904.00962.
  50. Coca: Contrastive captioners are image-text foundation models. arXiv preprint arXiv:2205.01917.
  51. Florence: A new foundation model for computer vision. arXiv preprint arXiv:2111.11432.
  52. Lit: Zero-shot transfer with locked-image text tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18123–18133.
  53. Minigpt-4: Enhancing vision-language understanding with advanced large language models.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Qingpei Guo (27 papers)
  2. Furong Xu (22 papers)
  3. Hanxiao Zhang (24 papers)
  4. Wang Ren (4 papers)
  5. Ziping Ma (4 papers)
  6. Lin Ju (10 papers)
  7. Jian Wang (966 papers)
  8. Jingdong Chen (61 papers)
  9. Ming Yang (289 papers)
Citations (1)
Github Logo Streamline Icon: https://streamlinehq.com