Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 84 tok/s
Gemini 2.5 Pro 45 tok/s Pro
GPT-5 Medium 28 tok/s Pro
GPT-5 High 21 tok/s Pro
GPT-4o 92 tok/s Pro
GPT OSS 120B 425 tok/s Pro
Kimi K2 157 tok/s Pro
2000 character limit reached

MOFI: Learning Image Representations from Noisy Entity Annotated Images (2306.07952v3)

Published 13 Jun 2023 in cs.CV, cs.CL, and cs.LG

Abstract: We present MOFI, Manifold OF Images, a new vision foundation model designed to learn image representations from noisy entity annotated images. MOFI differs from previous work in two key aspects: (i) pre-training data, and (ii) training recipe. Regarding data, we introduce a new approach to automatically assign entity labels to images from noisy image-text pairs. Our approach involves employing a named entity recognition model to extract entities from the alt-text, and then using a CLIP model to select the correct entities as labels of the paired image. It's a simple, cost-effective method that can scale to handle billions of web-mined image-text pairs. Through this method, we have created Image-to-Entities (I2E), a new dataset with 1 billion images and 2 million distinct entities, covering rich visual concepts in the wild. Building upon the I2E dataset, we study different training recipes like supervised pre-training, contrastive pre-training, and multi-task learning. For contrastive pre-training, we treat entity names as free-form text, and further enrich them with entity descriptions. Experiments show that supervised pre-training with large-scale fine-grained entity labels is highly effective for image retrieval tasks, and multi-task training further improves the performance. The final MOFI model achieves 86.66% mAP on the challenging GPR1200 dataset, surpassing the previous state-of-the-art performance of 72.19% from OpenAI's CLIP model. Further experiments on zero-shot and linear probe image classification also show that MOFI outperforms a CLIP model trained on the original image-text data, demonstrating the effectiveness of the I2E dataset in learning strong image representations. We release our code and model weights at https://github.com/apple/ml-mofi.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (66)
  1. Tensorflow Authors. Candidate sampling algorithms.
  2. Beit: Bert pre-training of image transformers. arXiv preprint arXiv:2106.08254, 2021.
  3. Less is more: Removing text-regions improves clip training efficiency and robustness. arXiv preprint arXiv:2305.05095, 2023.
  4. Emerging properties in self-supervised vision transformers. In CVPR, pages 9650–9660, 2021.
  5. Stair: Learning sparse text and image representation in grounded tokens. arXiv preprint arXiv:2301.13081, 2023.
  6. A simple framework for contrastive learning of visual representations. In ICML, 2020.
  7. Exploring simple siamese representation learning. In CVPR, 2021.
  8. An empirical study of training self-supervised vision transformers. In ICCV, pages 9640–9649, 2021.
  9. Uniter: Universal image-text representation learning. In ECCV, 2020.
  10. Imagenet: A large-scale hierarchical image database. In CVPR, 2009.
  11. Decaf: A deep convolutional activation feature for generic visual recognition. In ICML, 2014.
  12. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.
  13. Multiscale vision transformers. In ICCV, 2021.
  14. Eva: Exploring the limits of masked visual representation learning at scale. In CVPR, 2023.
  15. Datacomp: In search of the next generation of multimodal datasets. arXiv preprint arXiv:2304.14108, 2023.
  16. Vision-language pre-training: Basics, recent advances, and future trends. Foundations and Trends® in Computer Graphics and Vision, 2022.
  17. Large-scale weakly-supervised pre-training for video action recognition. In CVPR, 2019.
  18. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, 2014.
  19. Bootstrap your own latent-a new approach to self-supervised learning. NeurIPS, 2020.
  20. Masked autoencoders are scalable vision learners. In CVPR, 2022.
  21. Momentum contrast for unsupervised visual representation learning. In CVPR, 2020.
  22. Scaling up visual and vision-language representation learning with noisy text supervision. In ICML, 2021.
  23. Graph-rise: Graph-regularized image semantic embedding. arXiv preprint arXiv:1902.10814, 2019.
  24. Uniclip: Unified framework for contrastive language-image pre-training. arXiv preprint arXiv:2209.13430, 2022.
  25. Oscar: Object-semantics aligned pre-training for vision-language tasks. In ECCV, 2020.
  26. Scaling language-image pre-training via masking. In CVPR, 2023.
  27. Supervision exists everywhere: A data efficient contrastive language-image pre-training paradigm. arXiv preprint arXiv:2110.05208, 2021.
  28. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  29. Exploring the limits of weakly supervised pretraining. In ECCV, 2018.
  30. Multi-scale transformer-based feature combination for image retrieval. In ICIP, 2022.
  31. Slip: Self-supervision meets language-image pre-training. In ECCV, 2022.
  32. A metric learning reality check. In ECCV, 2020.
  33. Poincaré embeddings for learning hierarchical representations. NeurIPS, 2017.
  34. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023.
  35. Combined scaling for zero-shot transfer learning. arXiv preprint arXiv:2111.10050, 2023.
  36. Revisiting oxford and paris: Large-scale image retrieval benchmarking. CVPR, 2018.
  37. Learning transferable visual models from natural language supervision. In ICML, 2021.
  38. Faster r-cnn: Towards real-time object detection with region proposal networks. NeurIPS, 2015.
  39. Imagenet-21k pretraining for the masses. In NeurIPS Track on Datasets and Benchmarks, 2021.
  40. Imagenet-21k pretraining for the masses. arXiv preprint arXiv:2104.10972, 2021.
  41. Imagenet large scale visual recognition challenge. IJCV, 2015.
  42. Gpr1200: A benchmark for general-purpose content-based image retrieval. arXiv preprint arXiv:2111.13122, 2021.
  43. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114, 2021.
  44. Cnn features off-the-shelf: an astounding baseline for recognition. In CVPR workshops, 2014.
  45. K-lite: Learning transferable visual models with external knowledge. In NeurIPS, 2022.
  46. Revisiting weakly supervised pre-training of visual perception models. In CVPR, 2022.
  47. Revisiting unreasonable effectiveness of data in deep learning era. In ICCV, 2017.
  48. Attention is all you need. NeurIPS, 2017.
  49. Cosface: Large margin cosine loss for deep face recognition. CVPR, 2018.
  50. Self supervision does not help natural language supervision at scale. arXiv preprint arXiv:2301.07836, 2023.
  51. Masked feature prediction for self-supervised visual pre-training. In CVPR, 2022.
  52. Google Landmarks Dataset v2 - A Large-Scale Benchmark for Instance-Level Recognition and Retrieval. In Proc. CVPR, 2020.
  53. Data efficient language-supervised zero-shot recognition with optimal transport distillation. arXiv preprint arXiv:2112.09445, 2021.
  54. Unsupervised feature learning via non-parametric instance discrimination. CVPR, 2018.
  55. Simmim: A simple framework for masked image modeling. In CVPR, 2022.
  56. Unified contrastive learning in image-text-label space. In CVPR, 2022.
  57. Dolg: Single-stage image retrieval with deep orthogonal fusion of local and global features. In ICCV, 2021.
  58. Filip: fine-grained interactive language-image pre-training. arXiv preprint arXiv:2111.07783, 2021.
  59. Coca: Contrastive captioners are image-text foundation models. arXiv preprint arXiv:2205.01917, 2022.
  60. Florence: A new foundation model for computer vision. arXiv preprint arXiv:2111.11432, 2021.
  61. Scaling vision transformers. In CVPR, 2022.
  62. A large-scale study of representation learning with the visual task adaptation benchmark. arXiv preprint arXiv:1910.04867, 2020.
  63. Lit: Zero-shot transfer with locked-image text tuning. In CVPR, 2022.
  64. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
  65. Non-contrastive learning meets language-image pre-training. In CVPR, 2023.
  66. ibot: Image bert pre-training with online tokenizer. arXiv preprint arXiv:2111.07832, 2021.
Citations (6)
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

We haven't generated a summary for this paper yet.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Paper Prompts

Sign up for free to create and run prompts on this paper using GPT-5.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Github Logo Streamline Icon: https://streamlinehq.com

GitHub

X Twitter Logo Streamline Icon: https://streamlinehq.com

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube