Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MCAD: Multi-teacher Cross-modal Alignment Distillation for efficient image-text retrieval (2310.19654v3)

Published 30 Oct 2023 in cs.CV and cs.AI

Abstract: Due to the success of large-scale visual-language pretraining (VLP) models and the widespread use of image-text retrieval in industry areas, it is now critically necessary to reduce the model size and streamline their mobile-device deployment. Single- and dual-stream model structures are commonly used in image-text retrieval with the goal of closing the semantic gap between textual and visual modalities. While single-stream models use deep feature fusion to achieve more accurate cross-model alignment, dual-stream models are better at offline indexing and fast inference.We propose a Multi-teacher Cross-modality Alignment Distillation (MCAD) technique to integrate the advantages of single- and dual-stream models. By incorporating the fused single-stream features into the image and text features of the dual-stream model, we formulate new modified teacher similarity distributions and features. Then, we conduct both distribution and feature distillation to boost the capability of the student dual-stream model, achieving high retrieval performance without increasing inference complexity.Extensive experiments demonstrate the remarkable performance and high efficiency of MCAD on image-text retrieval tasks. Furthermore, we implement a lightweight CLIP model on Snapdragon/Dimensity chips with only $\sim$100M running memory and $\sim$8.0ms search latency, achieving the mobile-device application of VLP models.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (41)
  1. {UNITER}: Learning {un}iversal image-{te}xt representations.
  2. Similarity reasoning and filtration for image-text matching. In Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Virtual Event, February 2-9, 2021, pages 1218–1226. AAAI Press.
  3. Teaching structured vision & language concepts to vision & language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2657–2668.
  4. Compressing visual-linguistic model via knowledge distillation. In 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021, pages 1408–1418. IEEE.
  5. Ross B. Girshick. 2015. Fast R-CNN. In 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7-13, 2015, pages 1440–1448. IEEE Computer Society.
  6. Knowledge distillation: A survey. International Journal of Computer Vision, 129:1789–1819.
  7. Distilling the knowledge in a neural network. ArXiv preprint, abs/1503.02531.
  8. Scaling up visual and vision-language representation learning with noisy text supervision. In Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pages 4904–4916. PMLR.
  9. TinyBERT: Distilling BERT for natural language understanding. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4163–4174, Online. Association for Computational Linguistics.
  10. Deep visual-semantic alignments for generating image descriptions. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3128–3137.
  11. Vilt: Vision-and-language transformer without convolution or region supervision. In Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pages 5583–5594. PMLR.
  12. ALBERT: A lite BERT for self-supervised learning of language representations. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net.
  13. Loopitr: Combining dual and cross encoder architectures for image-text retrieval. ArXiv preprint, abs/2203.05465.
  14. Align before fuse: Vision and language representation learning with momentum distillation. In Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pages 9694–9705.
  15. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer.
  16. Focus your attention: A bidirectional focal attention network for image-text matching. In Proceedings of the 27th ACM International Conference on Multimedia, MM ’19, page 3–11, New York, NY, USA. Association for Computing Machinery.
  17. Ilya Loshchilov and Frank Hutter. 2019. Decoupled weight decay regularization. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net.
  18. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pages 13–23.
  19. Sachin Mehta and Mohammad Rastegari. 2022. Separable self-attention for mobile vision transformers. Transactions on Machine Learning Research.
  20. Thinking fast and slow: Efficient text-to-visual retrieval with transformers. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021, pages 9826–9836. Computer Vision Foundation / IEEE.
  21. Im2text: Describing images using 1 million captioned photographs. In Advances in Neural Information Processing Systems 24: 25th Annual Conference on Neural Information Processing Systems 2011. Proceedings of a meeting held 12-14 December 2011, Granada, Spain, pages 1143–1151.
  22. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In 2015 IEEE International Conference on Computer Vision (ICCV), pages 2641–2649.
  23. Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pages 8748–8763. PMLR.
  24. Dynamic contrastive distillation for image-text retrieval. IEEE Transactions on Multimedia.
  25. Faster R-CNN: towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, pages 91–99.
  26. Siyu Ren and Kenny Zhu. 2022. Leaner and faster: Two-stage model compression for lightweight text-image retrieval. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4085–4090, Seattle, United States. Association for Computational Linguistics.
  27. Laion-5b: An open large-scale dataset for training next generation image-text models. ArXiv preprint, abs/2210.08402.
  28. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. ArXiv preprint, abs/2111.02114.
  29. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2556–2565, Melbourne, Australia. Association for Computational Linguistics.
  30. Training data-efficient image transformers & distillation through attention. In Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pages 10347–10357. PMLR.
  31. Mobileclip: Fast image-text models through multi-modal reinforced training. ArXiv preprint, abs/2311.17049.
  32. Attention is all you need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pages 5998–6008.
  33. Shakti N Wadekar and Abhishek Chaurasia. 2022. Mobilevitv3: Mobile-friendly vision transformer with simple and effective fusion of local, global and input features. ArXiv preprint, abs/2209.15159.
  34. Efficientvlm: Fast and accurate vision-language models via knowledge distillation and modal-adaptive pruning. ArXiv preprint, abs/2210.07795.
  35. Distilled dual-encoder model for vision-language understanding. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 8901–8913, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  36. Simvlm: Simple visual language model pretraining with weak supervision. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net.
  37. Tinyclip: Clip distillation via affinity mimicking and weight inheritance. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 21970–21980.
  38. Billion-scale semi-supervised learning for image classification. ArXiv preprint, abs/1905.00546.
  39. Model compression with two-stage multi-teacher knowledge distillation for web question answering system. In WSDM ’20: The Thirteenth ACM International Conference on Web Search and Data Mining, Houston, TX, USA, February 3-7, 2020, pages 690–698. ACM.
  40. Autodisc: Automatic distillation schedule for large language model compression.
  41. Enhanced accuracy and robustness via multi-teacher adversarial distillation. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part IV, pages 585–602. Springer.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Youbo Lei (2 papers)
  2. Feifei He (3 papers)
  3. Chen Chen (753 papers)
  4. Yingbin Mo (2 papers)
  5. Si Jia Li (1 paper)
  6. Defeng Xie (3 papers)
  7. Haonan Lu (35 papers)

Summary

We haven't generated a summary for this paper yet.