MOFI: Learning Image Representations from Noisy Entity Annotated Images (2306.07952v3)
Abstract: We present MOFI, Manifold OF Images, a new vision foundation model designed to learn image representations from noisy entity annotated images. MOFI differs from previous work in two key aspects: (i) pre-training data, and (ii) training recipe. Regarding data, we introduce a new approach to automatically assign entity labels to images from noisy image-text pairs. Our approach involves employing a named entity recognition model to extract entities from the alt-text, and then using a CLIP model to select the correct entities as labels of the paired image. It's a simple, cost-effective method that can scale to handle billions of web-mined image-text pairs. Through this method, we have created Image-to-Entities (I2E), a new dataset with 1 billion images and 2 million distinct entities, covering rich visual concepts in the wild. Building upon the I2E dataset, we study different training recipes like supervised pre-training, contrastive pre-training, and multi-task learning. For contrastive pre-training, we treat entity names as free-form text, and further enrich them with entity descriptions. Experiments show that supervised pre-training with large-scale fine-grained entity labels is highly effective for image retrieval tasks, and multi-task training further improves the performance. The final MOFI model achieves 86.66% mAP on the challenging GPR1200 dataset, surpassing the previous state-of-the-art performance of 72.19% from OpenAI's CLIP model. Further experiments on zero-shot and linear probe image classification also show that MOFI outperforms a CLIP model trained on the original image-text data, demonstrating the effectiveness of the I2E dataset in learning strong image representations. We release our code and model weights at https://github.com/apple/ml-mofi.
- Tensorflow Authors. Candidate sampling algorithms.
- Beit: Bert pre-training of image transformers. arXiv preprint arXiv:2106.08254, 2021.
- Less is more: Removing text-regions improves clip training efficiency and robustness. arXiv preprint arXiv:2305.05095, 2023.
- Emerging properties in self-supervised vision transformers. In CVPR, pages 9650–9660, 2021.
- Stair: Learning sparse text and image representation in grounded tokens. arXiv preprint arXiv:2301.13081, 2023.
- A simple framework for contrastive learning of visual representations. In ICML, 2020.
- Exploring simple siamese representation learning. In CVPR, 2021.
- An empirical study of training self-supervised vision transformers. In ICCV, pages 9640–9649, 2021.
- Uniter: Universal image-text representation learning. In ECCV, 2020.
- Imagenet: A large-scale hierarchical image database. In CVPR, 2009.
- Decaf: A deep convolutional activation feature for generic visual recognition. In ICML, 2014.
- An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.
- Multiscale vision transformers. In ICCV, 2021.
- Eva: Exploring the limits of masked visual representation learning at scale. In CVPR, 2023.
- Datacomp: In search of the next generation of multimodal datasets. arXiv preprint arXiv:2304.14108, 2023.
- Vision-language pre-training: Basics, recent advances, and future trends. Foundations and Trends® in Computer Graphics and Vision, 2022.
- Large-scale weakly-supervised pre-training for video action recognition. In CVPR, 2019.
- Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, 2014.
- Bootstrap your own latent-a new approach to self-supervised learning. NeurIPS, 2020.
- Masked autoencoders are scalable vision learners. In CVPR, 2022.
- Momentum contrast for unsupervised visual representation learning. In CVPR, 2020.
- Scaling up visual and vision-language representation learning with noisy text supervision. In ICML, 2021.
- Graph-rise: Graph-regularized image semantic embedding. arXiv preprint arXiv:1902.10814, 2019.
- Uniclip: Unified framework for contrastive language-image pre-training. arXiv preprint arXiv:2209.13430, 2022.
- Oscar: Object-semantics aligned pre-training for vision-language tasks. In ECCV, 2020.
- Scaling language-image pre-training via masking. In CVPR, 2023.
- Supervision exists everywhere: A data efficient contrastive language-image pre-training paradigm. arXiv preprint arXiv:2110.05208, 2021.
- Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
- Exploring the limits of weakly supervised pretraining. In ECCV, 2018.
- Multi-scale transformer-based feature combination for image retrieval. In ICIP, 2022.
- Slip: Self-supervision meets language-image pre-training. In ECCV, 2022.
- A metric learning reality check. In ECCV, 2020.
- Poincaré embeddings for learning hierarchical representations. NeurIPS, 2017.
- Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023.
- Combined scaling for zero-shot transfer learning. arXiv preprint arXiv:2111.10050, 2023.
- Revisiting oxford and paris: Large-scale image retrieval benchmarking. CVPR, 2018.
- Learning transferable visual models from natural language supervision. In ICML, 2021.
- Faster r-cnn: Towards real-time object detection with region proposal networks. NeurIPS, 2015.
- Imagenet-21k pretraining for the masses. In NeurIPS Track on Datasets and Benchmarks, 2021.
- Imagenet-21k pretraining for the masses. arXiv preprint arXiv:2104.10972, 2021.
- Imagenet large scale visual recognition challenge. IJCV, 2015.
- Gpr1200: A benchmark for general-purpose content-based image retrieval. arXiv preprint arXiv:2111.13122, 2021.
- Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114, 2021.
- Cnn features off-the-shelf: an astounding baseline for recognition. In CVPR workshops, 2014.
- K-lite: Learning transferable visual models with external knowledge. In NeurIPS, 2022.
- Revisiting weakly supervised pre-training of visual perception models. In CVPR, 2022.
- Revisiting unreasonable effectiveness of data in deep learning era. In ICCV, 2017.
- Attention is all you need. NeurIPS, 2017.
- Cosface: Large margin cosine loss for deep face recognition. CVPR, 2018.
- Self supervision does not help natural language supervision at scale. arXiv preprint arXiv:2301.07836, 2023.
- Masked feature prediction for self-supervised visual pre-training. In CVPR, 2022.
- Google Landmarks Dataset v2 - A Large-Scale Benchmark for Instance-Level Recognition and Retrieval. In Proc. CVPR, 2020.
- Data efficient language-supervised zero-shot recognition with optimal transport distillation. arXiv preprint arXiv:2112.09445, 2021.
- Unsupervised feature learning via non-parametric instance discrimination. CVPR, 2018.
- Simmim: A simple framework for masked image modeling. In CVPR, 2022.
- Unified contrastive learning in image-text-label space. In CVPR, 2022.
- Dolg: Single-stage image retrieval with deep orthogonal fusion of local and global features. In ICCV, 2021.
- Filip: fine-grained interactive language-image pre-training. arXiv preprint arXiv:2111.07783, 2021.
- Coca: Contrastive captioners are image-text foundation models. arXiv preprint arXiv:2205.01917, 2022.
- Florence: A new foundation model for computer vision. arXiv preprint arXiv:2111.11432, 2021.
- Scaling vision transformers. In CVPR, 2022.
- A large-scale study of representation learning with the visual task adaptation benchmark. arXiv preprint arXiv:1910.04867, 2020.
- Lit: Zero-shot transfer with locked-image text tuning. In CVPR, 2022.
- Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
- Non-contrastive learning meets language-image pre-training. In CVPR, 2023.
- ibot: Image bert pre-training with online tokenizer. arXiv preprint arXiv:2111.07832, 2021.
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.