A Progressive Framework of Vision-language Knowledge Distillation and Alignment for Multilingual Scene (2404.11249v1)
Abstract: Pre-trained vision-language (V-L) models such as CLIP have shown excellent performance in many downstream cross-modal tasks. However, most of them are only applicable to the English context. Subsequent research has focused on this problem and proposed improved models, such as CN-CLIP and AltCLIP, to facilitate their applicability to Chinese and even other languages. Nevertheless, these models suffer from high latency and a large memory footprint in inference, which limits their further deployment on resource-constrained edge devices. In this work, we propose a conceptually simple yet effective multilingual CLIP Compression framework and train a lightweight multilingual vision-LLM, called DC-CLIP, for both Chinese and English context. In this framework, we collect high-quality Chinese and English text-image pairs and design two training stages, including multilingual vision-language feature distillation and alignment. During the first stage, lightweight image/text student models are designed to learn robust visual/multilingual textual feature representation ability from corresponding teacher models, respectively. Subsequently, the multilingual vision-language alignment stage enables effective alignment of visual and multilingual textual features to further improve the model's multilingual performance. Comprehensive experiments in zero-shot image classification, conducted based on the ELEVATER benchmark, showcase that DC-CLIP achieves superior performance in the English context and competitive performance in the Chinese context, even with less training data, when compared to existing models of similar parameter magnitude. The evaluation demonstrates the effectiveness of our designed training mechanism.
- Contrastive language-image pre-training for the italian language. arXiv preprint arXiv:2108.08688.
- Food-101–mining discriminative components with random forests. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part VI 13, 446–461. Springer.
- Model compression. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, 535–541.
- Cross-lingual and multilingual clip. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, 6848–6854.
- Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 3558–3568.
- Knowledge distillation with feature maps for image classification. In Computer Vision–ACCV 2018: 14th Asian Conference on Computer Vision, Perth, Australia, December 2–6, 2018, Revised Selected Papers, Part III 14, 200–215. Springer.
- Altclip: Altering the language encoder in clip for extended language capabilities. arXiv preprint arXiv:2211.06679.
- Unsupervised cross-lingual representation learning at scale. arXiv preprint arXiv:1911.02116.
- Pre-Training with Whole Word Masking for Chinese BERT.
- Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, 248–255. Ieee.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.
- Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. In 2004 conference on computer vision and pattern recognition workshop, 178–178. IEEE.
- Astraea: Deploy ai services at the edge in elegant ways. In 2020 IEEE International Conference on Edge Computing (EDGE), 49–53. IEEE.
- Girshick, R. 2015. Fast r-cnn. In Proceedings of the IEEE international conference on computer vision, 1440–1448.
- Knowledge distillation: A survey. International Journal of Computer Vision, 129(6): 1789–1819.
- Wukong: A 100 million large-scale chinese cross-modal pre-training benchmark. Advances in Neural Information Processing Systems, 35: 26418–26431.
- Open-vocabulary object detection via vision and language knowledge distillation. arXiv preprint arXiv:2104.13921.
- Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 9729–9738.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, 770–778.
- Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 12(7): 2217–2226.
- Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531.
- Scaling up visual and vision-language representation learning with noisy text supervision. In International conference on machine learning, 4904–4916. PMLR.
- Maple: Multi-modal prompt learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 19113–19122.
- 3d object representations for fine-grained categorization. In Proceedings of the IEEE international conference on computer vision workshops, 554–561.
- Language-driven semantic segmentation. arXiv preprint arXiv:2201.03546.
- Elevater: A benchmark and toolkit for evaluating language-augmented visual models. Advances in Neural Information Processing Systems, 35: 9287–9301.
- Grounded language-image pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10965–10975.
- Tiny Machine Learning: Progress and Futures [Feature]. IEEE Circuits and Systems Magazine, 23(3): 8–34.
- Fine-grained visual classification of aircraft. arXiv preprint arXiv:1306.5151.
- TinyML Techniques for running Machine Learning models on Edge Devices. In Proceedings of the Second International Conference on AI-ML Systems, 1–2.
- Automated flower classification over a large number of classes. In 2008 Sixth Indian conference on computer vision, graphics & image processing, 722–729. IEEE.
- Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748.
- Cats and dogs. In 2012 IEEE conference on computer vision and pattern recognition, 3498–3505. IEEE.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, 8748–8763. PMLR.
- Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems, 35: 25278–25294.
- Flava: A foundational language and vision alignment model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 15638–15650.
- UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402.
- Training very deep networks. Advances in neural information processing systems, 28.
- Yfcc100m: The new data in multimedia research. Communications of the ACM, 59(2): 64–73.
- Minilmv2: Multi-head self-attention relation distillation for compressing pretrained transformers. arXiv preprint arXiv:2012.15828.
- Contrastive learning rivals masked image modeling in fine-tuning via feature distillation. arXiv preprint arXiv:2205.14141.
- Tinyclip: Clip distillation via affinity mimicking and weight inheritance. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 21970–21980.
- Groupvit: Semantic segmentation emerges from text supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 18134–18144.
- A survey on knowledge distillation of large language models. arXiv preprint arXiv:2402.13116.
- Chinese clip: Contrastive vision-language pretraining in chinese. arXiv preprint arXiv:2211.01335.
- Unified contrastive learning in image-text-label space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 19163–19173.
- Florence: A new foundation model for computer vision. arXiv preprint arXiv:2111.11432.
- Lit: Zero-shot transfer with locked-image text tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 18123–18133.
- Cross-modal contrastive learning for text-to-image generation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 833–842.
- Glipv2: Unifying localization and vision-language understanding. Advances in Neural Information Processing Systems, 35: 36067–36080.
- Fengshenbang 1.0: Being the Foundation of Chinese Cognitive Intelligence. CoRR, abs/2209.02970.
- Tip-adapter: Training-free clip-adapter for better vision-language modeling. arXiv preprint arXiv:2111.03930.
- Learning to prompt for vision-language models. International Journal of Computer Vision, 130(9): 2337–2348.
- Wenbo Zhang (49 papers)
- Yifan Zhang (245 papers)
- Jianfeng Lin (30 papers)
- Binqiang Huang (3 papers)
- Jinlu Zhang (14 papers)
- Wenhao Yu (139 papers)