Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

LightCLIP: Learning Multi-Level Interaction for Lightweight Vision-Language Models (2312.00674v1)

Published 1 Dec 2023 in cs.CV

Abstract: Vision-language pre-training like CLIP has shown promising performance on various downstream tasks such as zero-shot image classification and image-text retrieval. Most of the existing CLIP-alike works usually adopt relatively large image encoders like ResNet50 and ViT, while the lightweight counterparts are rarely discussed. In this paper, we propose a multi-level interaction paradigm for training lightweight CLIP models. Firstly, to mitigate the problem that some image-text pairs are not strictly one-to-one correspondence, we improve the conventional global instance-level alignment objective by softening the label of negative samples progressively. Secondly, a relaxed bipartite matching based token-level alignment objective is introduced for finer-grained alignment between image patches and textual words. Moreover, based on the observation that the accuracy of CLIP model does not increase correspondingly as the parameters of text encoder increase, an extra objective of masked LLMing (MLM) is leveraged for maximizing the potential of the shortened text encoder. In practice, an auxiliary fusion module injecting unmasked image embedding into masked text embedding at different network stages is proposed for enhancing the MLM. Extensive experiments show that without introducing additional computational cost during inference, the proposed method achieves a higher performance on multiple downstream tasks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (62)
  1. Robust cross-modal representation learning with progressive self-distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  16430–16441, 2022.
  2. Food-101–mining discriminative components with random forests. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part VI 13, pp. 446–461. Springer, 2014.
  3. End-to-end object detection with transformers. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I 16, pp.  213–229. Springer, 2020.
  4. Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  3558–3568, 2021.
  5. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pp. 1597–1607. PMLR, 2020a.
  6. Improved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297, 2020b.
  7. Uniter: Universal image-text representation learning. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXX, pp.  104–120. Springer, 2020c.
  8. Describing textures in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  3606–3613, 2014.
  9. An analysis of single-layer networks in unsupervised feature learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pp.  215–223. JMLR Workshop and Conference Proceedings, 2011.
  10. Democratizing contrastive language-image pre-training: A clip benchmark of data, model, and supervision. arXiv preprint arXiv:2203.05796, 2022.
  11. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp.  248–255. Ieee, 2009.
  12. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  13. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  14. An empirical study of training end-to-end vision-and-language transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  18166–18176, 2022.
  15. Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. In 2004 conference on computer vision and pattern recognition workshop, pp.  178–178. IEEE, 2004.
  16. Large-scale adversarial training for vision-and-language representation learning. Advances in Neural Information Processing Systems, 33:6616–6628, 2020.
  17. Pyramidclip: Hierarchical feature alignment for vision-language model pretraining. Advances in Neural Information Processing Systems, 35:35959–35970, 2022.
  18. Softclip: Softer cross-modal alignment makes clip stronger. arXiv preprint arXiv:2303.17561, 2023.
  19. Girshick, R. Fast r-cnn. In Proceedings of the IEEE international conference on computer vision, pp.  1440–1448, 2015.
  20. Cyclip: Cyclic contrastive language-image pretraining. Advances in Neural Information Processing Systems, 35:6704–6719, 2022.
  21. Challenges in representation learning: A report on three machine learning contests. In Neural Information Processing: 20th International Conference, ICONIP 2013, Daegu, Korea, November 3-7, 2013. Proceedings, Part III 20, pp.  117–124. Springer, 2013.
  22. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  6904–6913, 2017.
  23. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  770–778, 2016.
  24. Framing image description as a ranking task: Data, models and evaluation metrics. Journal of Artificial Intelligence Research, 47:853–899, 2013.
  25. Nlip: Noise-robust language-image pre-training. arXiv preprint arXiv:2212.07086, 2022.
  26. Huawei. Mindspore. https://www.mindspore.cn/, 2020.
  27. Scaling up visual and vision-language representation learning with noisy text supervision. In International Conference on Machine Learning, pp. 4904–4916. PMLR, 2021.
  28. Vilt: Vision-and-language transformer without convolution or region supervision. In International Conference on Machine Learning, pp. 5583–5594. PMLR, 2021.
  29. Learning multiple layers of features from tiny images. 2009.
  30. Uniclip: Unified framework for contrastive language-image pre-training. arXiv preprint arXiv:2209.13430, 2022.
  31. Align before fuse: Vision and language representation learning with momentum distillation. Advances in neural information processing systems, 34:9694–9705, 2021a.
  32. Visualbert: A simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557, 2019.
  33. Unimo: Towards unified-modal understanding and generation via cross-modal contrastive learning. arXiv preprint arXiv:2012.15409, 2020a.
  34. Oscar: Object-semantics aligned pre-training for vision-language tasks. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXX 16, pp.  121–137. Springer, 2020b.
  35. Supervision exists everywhere: A data efficient contrastive language-image pre-training paradigm. arXiv preprint arXiv:2110.05208, 2021b.
  36. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pp. 740–755. Springer, 2014.
  37. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pp.  10012–10022, 2021.
  38. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  39. Cots: Collaborative two-stream vision-language pre-training model for cross-modal retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  15692–15701, 2022.
  40. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in neural information processing systems, 32, 2019.
  41. Mixed precision training. arXiv preprint arXiv:1710.03740, 2017.
  42. Slip: Self-supervision meets language-image pre-training. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXVI, pp.  529–544. Springer, 2022.
  43. Automated flower classification over a large number of classes. In 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing, pp.  722–729. IEEE, 2008.
  44. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
  45. Cats and dogs. In 2012 IEEE conference on computer vision and pattern recognition, pp.  3498–3505. IEEE, 2012.
  46. Filtering, distillation, and hard negatives for vision-language pre-training. arXiv preprint arXiv:2301.02280, 2023.
  47. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  48. Learning transferable visual models from natural language supervision. In International conference on machine learning, pp. 8748–8763. PMLR, 2021.
  49. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
  50. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  4510–4520, 2018.
  51. Laion-5b: An open large-scale dataset for training next generation image-text models. arXiv preprint arXiv:2210.08402, 2022.
  52. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE international conference on computer vision, pp.  618–626, 2017.
  53. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  2556–2565, 2018.
  54. End-to-end people detection in crowded scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  2325–2333, 2016.
  55. Yfcc100m: The new data in multimedia research. Communications of the ACM, 59(2):64–73, 2016.
  56. Visualizing data using t-sne. Journal of machine learning research, 9(11), 2008.
  57. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  58. Sun database: Large-scale scene recognition from abbey to zoo. In 2010 IEEE computer society conference on computer vision and pattern recognition, pp.  3485–3492. IEEE, 2010.
  59. Filip: fine-grained interactive language-image pre-training. arXiv preprint arXiv:2111.07783, 2021.
  60. Coca: Contrastive captioners are image-text foundation models. arXiv preprint arXiv:2205.01917, 2022.
  61. Vinvl: Revisiting visual representations in vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  5579–5588, 2021.
  62. Tokenflow: Rethinking fine-grained cross-modal alignment in vision-language retrieval. arXiv preprint arXiv:2209.13822, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Ying Nie (15 papers)
  2. Wei He (188 papers)
  3. Kai Han (184 papers)
  4. Yehui Tang (63 papers)
  5. Tianyu Guo (33 papers)
  6. Fanyi Du (2 papers)
  7. Yunhe Wang (145 papers)
Citations (2)