Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Dynamic Self-adaptive Multiscale Distillation from Pre-trained Multimodal Large Model for Efficient Cross-modal Representation Learning (2404.10838v1)

Published 16 Apr 2024 in cs.CV, cs.CL, and cs.MM

Abstract: In recent years, pre-trained multimodal large models have attracted widespread attention due to their outstanding performance in various multimodal applications. Nonetheless, the extensive computational resources and vast datasets required for their training present significant hurdles for deployment in environments with limited computational resources. To address this challenge, we propose a novel dynamic self-adaptive multiscale distillation from pre-trained multimodal large model for efficient cross-modal representation learning for the first time. Unlike existing distillation methods, our strategy employs a multiscale perspective, enabling the extraction structural knowledge across from the pre-trained multimodal large model. Ensuring that the student model inherits a comprehensive and nuanced understanding of the teacher knowledge. To optimize each distillation loss in a balanced and efficient manner, we propose a dynamic self-adaptive distillation loss balancer, a novel component eliminating the need for manual loss weight adjustments and dynamically balances each loss item during the distillation process. Our methodology streamlines pre-trained multimodal large models using only their output features and original image-level information, requiring minimal computational resources. This efficient approach is suited for various applications and allows the deployment of advanced multimodal technologies even in resource-limited settings. Extensive experiments has demonstrated that our method maintains high performance while significantly reducing model complexity and training costs. Moreover, our distilled student model utilizes only image-level information to achieve state-of-the-art performance on cross-modal retrieval tasks, surpassing previous methods that relied on region-level information.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (59)
  1. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition. 6077–6086.
  2. Fine-grained angular contrastive learning with coarse labels. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 8730–8740.
  3. Imram: Iterative matching with recurrent attention memory for cross-modal image-text retrieval. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 12655–12663.
  4. Learning the best pooling strategy for visual semantic embedding. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 15789–15798.
  5. A simple framework for contrastive learning of visual representations. In International conference on machine learning. PMLR, 1597–1607.
  6. Jang Hyun Cho and Bharath Hariharan. 2019. On the efficacy of knowledge distillation. In Proceedings of the IEEE/CVF international conference on computer vision. 4794–4802.
  7. Empirical evaluation of gated recurrent neural networks on sequence modeling. In NIPS 2014 Workshop on Deep Learning, December 2014.
  8. Parametric contrastive learning. In Proceedings of the IEEE/CVF international conference on computer vision. 715–724.
  9. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
  10. Plug-and-Play Regulators for Image-Text Matching. IEEE Transactions on Image Processing 32 (2023), 2322–2334.
  11. Similarity reasoning and filtration for image-text matching. In Proceedings of the AAAI conference on artificial intelligence, Vol. 35. 1218–1226.
  12. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).
  13. With a little help from my friends: Nearest-neighbor contrastive learning of visual representations. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 9588–9597.
  14. Born again neural networks. In International conference on machine learning. PMLR, 1607–1616.
  15. Momentum Contrast for Unsupervised Visual Representation Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
  16. A comprehensive overhaul of feature distillation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1921–1930.
  17. Knowledge transfer via distillation of activation boundaries formed by hidden neurons. In Proceedings of the AAAI conference on artificial intelligence, Vol. 33. 3779–3787.
  18. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015).
  19. Step-Wise Hierarchical Alignment Network for Image-Text Matching. (2021).
  20. Scaling up visual and vision-language representation learning with noisy text supervision. In International conference on machine learning. PMLR, 4904–4916.
  21. Paraphrasing complex network: Network compression via factor transfer. Advances in neural information processing systems 31 (2018).
  22. Vilt: Vision-and-language transformer without convolution or region supervision. In International conference on machine learning. PMLR, 5583–5594.
  23. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision 123 (2017), 32–73.
  24. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems 25 (2012).
  25. Albert: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942 (2019).
  26. Stacked cross attention for image-text matching. In Proceedings of the European conference on computer vision (ECCV). 201–216.
  27. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International conference on machine learning. PMLR, 12888–12900.
  28. Align before fuse: Vision and language representation learning with momentum distillation. Advances in neural information processing systems 34 (2021), 9694–9705.
  29. Visual semantic reasoning for image-text matching. In Proceedings of the IEEE/CVF international conference on computer vision. 4654–4662.
  30. Image-text embedding learning via visual and textual semantic reasoning. IEEE transactions on pattern analysis and machine intelligence 45, 1 (2022), 641–656.
  31. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13. Springer, 740–755.
  32. Focus Your Attention: A Bidirectional Focal Attention Network for Image-Text Matching. In Proceedings of the 27th ACM International Conference on Multimedia (Nice, France) (MM ’19). Association for Computing Machinery, New York, NY, USA, 3–11. https://doi.org/10.1145/3343031.3350869
  33. Graph structured network for image-text matching. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 10921–10930.
  34. End-To-End Multi-Task Learning With Attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
  35. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision. 10012–10022.
  36. Ilya Loshchilov and Frank Hutter. 2019. Decoupled Weight Decay Regularization. arXiv:1711.05101 [cs.LG]
  37. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018).
  38. Relational knowledge distillation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 3967–3976.
  39. Context-Aware Multi-View Summarization Network for Image-Text Matching. In Proceedings of the 28th ACM International Conference on Multimedia (Seattle, WA, USA) (MM ’20). Association for Computing Machinery, New York, NY, USA, 1047–1055. https://doi.org/10.1145/3394171.3413961
  40. Dynamic modality interaction modeling for image-text retrieval. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1104–1113.
  41. Learning transferable visual models from natural language supervision. In International conference on machine learning. PMLR, 8748–8763.
  42. Dynamic Contrastive Distillation for Image-Text Retrieval. IEEE Transactions on Multimedia 25 (2023), 8383–8395.
  43. Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems 28 (2015).
  44. Fitnets: Hints for thin deep nets. arXiv preprint arXiv:1412.6550 (2014).
  45. Imagenet large scale visual recognition challenge. International journal of computer vision 115 (2015), 211–252.
  46. Contrastive Representation Distillation. In International Conference on Learning Representations.
  47. Training data-efficient image transformers & distillation through attention. In International Conference on Machine Learning, Vol. 139. 10347–10357.
  48. Going Deeper With Image Transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 32–42.
  49. Attention is all you need. Advances in neural information processing systems 30 (2017).
  50. Image as a foreign language: Beit pretraining for vision and vision-language tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 19175–19186.
  51. Wasserstein coupled graph learning for cross-modal retrieval. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV). IEEE, 1793–1802.
  52. Adaptive cross-modal embeddings for image-text alignment. In Proceedings of the AAAI conference on artificial intelligence, Vol. 34. 12313–12320.
  53. Multi-modality cross attention network for image and sentence matching. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 10941–10950.
  54. CLIP-KD: An Empirical Study of Distilling CLIP Models. arXiv preprint arXiv:2307.12732 (2023).
  55. Vision-language pre-training with triple contrastive learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 15671–15680.
  56. A gift from knowledge distillation: Fast optimization, network minimization and transfer learning. In Proceedings of the IEEE conference on computer vision and pattern recognition. 4133–4141.
  57. Sergey Zagoruyko and Nikos Komodakis. 2016. Paying More Attention to Attention: Improving the Performance of Convolutional Neural Networks via Attention Transfer. In International Conference on Learning Representations.
  58. Context-aware attention network for image-text retrieval. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 3536–3545.
  59. Decoupled knowledge distillation. In Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition. 11953–11962.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Zhengyang Liang (10 papers)
  2. Meiyu Liang (17 papers)
  3. Wei Huang (318 papers)
  4. Yawen Li (34 papers)
  5. Zhe Xue (26 papers)