Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Data Efficient Language-supervised Zero-shot Recognition with Optimal Transport Distillation (2112.09445v3)

Published 17 Dec 2021 in cs.CV

Abstract: Traditional computer vision models are trained to predict a fixed set of predefined categories. Recently, natural language has been shown to be a broader and richer source of supervision that provides finer descriptions to visual concepts than supervised "gold" labels. Previous works, such as CLIP, use InfoNCE loss to train a model to predict the pairing between images and text captions. CLIP, however, is data hungry and requires more than 400M image-text pairs for training. The inefficiency can be partially attributed to the fact that the image-text pairs are noisy. To address this, we propose OTTER (Optimal TransporT distillation for Efficient zero-shot Recognition), which uses online entropic optimal transport to find a soft image-text match as labels for contrastive learning. Based on pretrained image and text encoders, models trained with OTTER achieve strong performance with only 3M image text pairs. Compared with InfoNCE loss, label smoothing, and knowledge distillation, OTTER consistently outperforms these baselines in zero shot evaluation on Google Open Images (19,958 classes) and multi-labeled ImageNet 10K (10032 classes) from Tencent ML-Images. Over 42 evaluations on 7 different dataset/architecture settings x 6 metrics, OTTER outperforms (32) or ties (2) all baselines in 34 of them.

Data Efficient Language-Supervised Zero-Shot Recognition with OTTER

The paper "OTTER: Data Efficient Language-Supervised Zero-Shot Recognition with Optimal Transport Distillation" presents an innovative approach to enhance zero-shot learning (ZSL) in computer vision. Unlike conventional models trained to predict a fixed set of categories, OTTER leverages the richness of natural language supervision to improve visual recognition tasks. The core contribution of OTTER lies in its use of online entropic optimal transport to achieve efficient data utilization in language-supervised zero-shot learning.

Key Contributions

  1. Optimal Transport Distillation: OTTER improves upon prior methods such as CLIP, which employs InfoNCE loss for contrastive learning between image-text pairs. Typically, CLIP requires enormous datasets of over 400 million image-text pairs due to the noisy label nature of these datasets. OTTER introduces an optimal transport-based approach to refine the match between images and text captions, thus providing more accurate supervision for training.
  2. Reduction in Data Requirements: The model achieves robust performance with significantly fewer data points. By using only 3 million image-text pairs, OTTER demonstrates capabilities competitive with, or superior to, previous models operating with far larger datasets.
  3. Zero-Shot Evaluation: In comparison to widely used methods like CLIP, OTTER's methodology has been rigorously tested across numerous zero-shot evaluation metrics over diverse datasets, such as Google Open Images and multi-labeled Imagenet 10K. Out of 42 distinct evaluations, OTTER outperformed existing baselines in 34 cases and tied in two.

Strong Numerical Results

The numerical results underscore the efficacy of the proposed method. With a strong performance across diverse architecture settings, OTTER's approach of using optimal transport for data alignment within image-text pairs proves a substantial advance in data efficiency. This has significant implications, particularly demonstrated by its competitive results compared to models trained on datasets orders of magnitude larger.

Implications and Future Directions

The practical implications of OTTER are considerable, offering a blueprint for developing efficient models capable of zero-shot classification with fewer labeled samples. Theoretically, the paper highlights optimal transport's potential in enhancing contrastive learning frameworks, opening avenues for further exploration in optimizing label noise management.

Looking ahead, subsequent research could delve into OTTER's application on broader datasets, such as those similar in size to CLIP 400M. Such extensions may validate the scaling capabilities of entropic regularized optimal transport and potentially uncover even more refined strategies for enhancing ZSL performance.

The paper's contribution to artificial intelligence, especially in the domains of computer vision and language processing, marks a notable advancement by proposing a fresh mechanism to tackle traditional bottlenecks associated with large-scale data requirements in ZSL.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (66)
  1. Label-embedding for attribute-based classification. CVPR, 2013.
  2. Evaluation of output embeddings for fine-grained image classification. CVPR, 2015.
  3. Self-labelling via simultaneous clustering and representation learning. arXiv preprint arXiv:1911.05371, 2019.
  4. Label refinery: Improving ima- genet classification through label progression. arXiv preprint arXiv:1805.02641, 2018.
  5. A large annotated corpus for learning natural language inference. 2015.
  6. Unsupervised learning of visual features by contrasting cluster assignments. arXiv preprint arXiv:2006.09882, 2020.
  7. Emerging properties in self-supervised vision transformers. arXiv preprint arXiv:2104.14294, 2021.
  8. Learning the best pooling strategy for visual semantic embedding. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021a.
  9. Graph optimal transport for cross-domain alignment. ICML, 2020a.
  10. Wasserstein contrastive representation distillation. CVPR, 2021b.
  11. A simple framework for contrastive learning of visual representations. ICML, 2020b.
  12. Uniter: Universal image-text representation learning. In ECCV, 2020c.
  13. Optimal transport for domain adaptation. arXiv preprint arXiv:1507.00504v2, 2016.
  14. Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. NeurIPS, 2013.
  15. Fbnetv3: Joint architecture-recipe search using neural acquisition function. arXiv preprint arXiv:2006.02049, 2020.
  16. An entropic optimal transport loss for learning deep neural networks under label noise in remote sensing images. arXiv preprint arXiv:1810.01163, 2018.
  17. Imagenet: A large-scale hierarchical image database. pp.  248–255, 2009.
  18. Virtex: Learning visual representations from textual annotations. arXiv preprint arXiv:2006.06666, 2020.
  19. wordnet: WordNet Interface, 2020. URL https://CRAN.R-project.org/package=wordnet. R package version 0.1-15.
  20. Devise: A deep visual-semantic embedding model. NIPS, 2013.
  21. Declutr: Deep contrastive learning for unsupervised textual representations. arXiv preprint arXiv:2006.03659, 2020.
  22. Dimensionality reduction by learning an invariant mapping. CVPR, 2006.
  23. Deep residual learning for image recognition. CVPR, 2016.
  24. Momentum contrast for unsupervised visual representation learning. CVPR, 2020.
  25. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
  26. Image-to-word transformation based on dividing and vector quantizing images with words. In in Boltzmann machines”, Neural Networks, pp.  405409, 1999.
  27. Scaling up visual and vision-language representation learning with noisy text supervision. arXiv preprint arXiv:2102.05918, 2020.
  28. Learning visual features from large weakly supervised data. In ECCV, 2016.
  29. Rethinking knowledge graph propagation for zero-shot learning. CVPR, 2019.
  30. Deep fragment embeddings for bidirectional image sentence mapping. In Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2, NIPS’14, pp.  1889–1897, Cambridge, MA, USA, 2014. MIT Press.
  31. Big transfer (bit): General visual representation learning. arXiv preprint arXiv:1912.11370, 2020.
  32. 3d object representations for fine-grained categorization. In 4th International IEEE Workshop on 3D Representation and Recognition (3dRR-13), Sydney, Australia, 2013.
  33. The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. IJCV, 2020.
  34. Learning visual n-grams from web data. In 2017 IEEE International Conference on Computer Vision (ICCV), pp.  4193–4202, 2017. doi: 10.1109/ICCV.2017.449.
  35. Visual semantic reasoning for image-text matching. In ICCV, 2019.
  36. Hyperbolic visual embedding learning for zero-shot recognition. CVPR, 2020.
  37. Unbiased teacher for semi-supervised object detection. arXiv preprint arXiv:2102.09480, 2021.
  38. S2ORC: The semantic scholar open research corpus. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.  4969–4983, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.447. URL https://www.aclweb.org/anthology/2020.acl-main.447.
  39. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. arXiv preprint arXiv:1908.02265, 2019.
  40. Fine-grained visual classification of aircraft. Technical report, 2013.
  41. Zero-shot learning by convex combination of semantic embeddings. ICLR, 2014.
  42. Pytorch: An imperative style, high-performance deep learning library. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (eds.), Advances in Neural Information Processing Systems 32, pp.  8024–8035. Curran Associates, Inc., 2019. URL http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf.
  43. Imagebert: Cross-modal pre-training with large-scale weak-supervised image-text data. arXiv preprint arXiv:2001.07966, 2020.
  44. Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020, 2021.
  45. Video event understanding using natural language descriptions. In 2013 IEEE International Conference on Computer Vision, pp. 905–912, 2013. doi: 10.1109/ICCV.2013.117.
  46. Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 11 2019. URL https://arxiv.org/abs/1908.10084.
  47. Contrastive learning with hard negative samples. arXiv preprint arXiv:2010.04592, 2020.
  48. Bernardino Romera-Paredes and Philip H. S. Torr. An embarrassingly simple approach to zero-shot learning. ICML, 2015.
  49. Improving gans using optimal transport. ICLR, 2018.
  50. Learning visual representations with caption annotations. arXiv preprint arXiv:2008.01392, 2020.
  51. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of ACL, 2018.
  52. Grounded compositional semantics for finding and describing images with sentences. Transactions of the Association for Computational Linguistics, 2:207–218, 2014. doi: 10.1162/tacl_a_00177. URL https://www.aclweb.org/anthology/Q14-1017.
  53. Learning from noisy labels with deep neural networks: A survey. arXiv preprint arXiv:2007.08199, 2020.
  54. Wit: Wikipedia-based image text dataset for multimodal multilingual machine learning. arXiv preprint arXiv:2103.01913, 2021.
  55. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  2818–2826, 2016.
  56. Yfcc100m: The new data in multimedia research. Commun. ACM, 59(2):64–73, jan 2016. ISSN 0001-0782. doi: 10.1145/2812802. URL https://doi.org/10.1145/2812802.
  57. Contrastive representation distillation. arXiv preprint arXiv:1910.10699, 2019.
  58. Fbnetv2: Differentiable neural architecture search for spatial and channel dimensions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  12965–12974, 2020.
  59. Learning models for object recognition from natural language descriptions. In Proceedings of the British Machine Vision Conference, 2009.
  60. Zero-shot recognition via semantic embeddings and knowledge graphs. CVPR, 2018.
  61. Caltech-UCSD Birds 200. Technical Report CNS-TR-2010-001, California Institute of Technology, 2010.
  62. A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp.  1112–1122. Association for Computational Linguistics, 2018. URL http://aclweb.org/anthology/N18-1101.
  63. Tencent ml-images: A large-scale multi-label image database for visual representation learning. IEEE Access, 7, 2019a.
  64. Fbnet: Hardware-aware efficient convnet design via differentiable neural architecture search. CVPR, 2019b.
  65. Self-training with noisy student improves imagenet classification. CVPR, 2020.
  66. Contrastive learning of medical visual representations from paired images and texts. arXiv preprint arXiv:2010.00747, 2020.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Bichen Wu (52 papers)
  2. Ruizhe Cheng (3 papers)
  3. Peizhao Zhang (40 papers)
  4. Tianren Gao (7 papers)
  5. Peter Vajda (52 papers)
  6. Joseph E. Gonzalez (167 papers)
Citations (42)