Meta-learning For Vision-and-language Cross-lingual Transfer (2305.14843v2)
Abstract: Current pre-trained vison-LLMs (PVLMs) achieve excellent performance on a range of multi-modal datasets. Recent work has aimed at building multilingual models, and a range of novel multilingual multi-modal datasets have been proposed. Current PVLMs typically perform poorly on these datasets when used for multi-modal zero-shot or few-shot cross-lingual transfer, especially for low-resource languages. To alleviate this problem, we propose a novel meta-learning fine-tuning framework. Our framework makes current PVLMs rapidly adaptive to new languages in vision-language scenarios by designing MAML in a cross-lingual multi-modal manner. Experiments show that our method boosts the performance of current state-of-the-art PVLMs in both zero-shot and few-shot cross-lingual transfer on a range of vision-language understanding tasks and datasets (XVNLI, xGQA, MaRVL, xFlicker&Co)
- Iglue: A benchmark for transfer learning across modalities, tasks, and languages. arXiv preprint arXiv:2201.11732.
- Uniter: Universal image-text representation learning. In European conference on computer vision, pages 104–120. Springer.
- Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8440–8451.
- Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186.
- Multi30k: Multilingual english-german image descriptions. arXiv preprint arXiv:1605.00459.
- Model-agnostic meta-learning for fast adaptation of deep networks. In International conference on machine learning, pages 1126–1135. PMLR.
- Image pivoting for learning multilingual multimodal representations. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2839–2845, Copenhagen, Denmark. Association for Computational Linguistics.
- Meta-learning for low-resource neural machine translation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 3622–3631, Brussels, Belgium. Association for Computational Linguistics.
- Drew A Hudson and Christopher D Manning. 2019. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6700–6709.
- Learning to generalize: Meta-learning for domain generalization. In Proceedings of the AAAI conference on artificial intelligence, volume 32.
- Microsoft coco: Common objects in context.
- Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer.
- Visually grounded reasoning across languages and cultures. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 10467–10485.
- Ilya Loshchilov and Frank Hutter. 2018. Decoupled weight decay regularization. In International Conference on Learning Representations.
- Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in neural information processing systems, 32.
- M3p: Learning universal representations via multitask multilingual multimodal pre-training. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3977–3986.
- Zero-shot cross-lingual transfer with meta learning. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4547–4562.
- Abiola Obamuyide and Andreas Vlachos. 2019. Model-agnostic meta-learning for relation classification with limited supervision. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 5873–5879.
- xgqa: Cross-lingual visual question answering. In Findings of the Association for Computational Linguistics: ACL 2022, pages 2497–2511.
- Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In Proceedings of the IEEE international conference on computer vision, pages 2641–2649.
- Multilingual multimodal learning with machine translated text. arXiv preprint arXiv:2210.13134.
- Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pages 8748–8763. PMLR.
- Meta-learning for semi-supervised few-shot classification. arXiv preprint arXiv:1803.00676.
- Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems, 28.
- Bridging languages through images with deep partial canonical correlation analysis. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 910–921, Melbourne, Australia. Association for Computational Linguistics.
- Vl-bert: Pre-training of generic visual-linguistic representations. In International Conference on Learning Representations.
- A corpus for reasoning about natural language grounded in photographs. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 6418–6428.
- Hao Tan and Mohit Bansal. 2019. Lxmert: Learning cross-modality encoder representations from transformers. arXiv preprint arXiv:1908.07490.
- Meta-learning for generalized zero-shot learning. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 6062–6069.
- Matching networks for one shot learning. Advances in neural information processing systems, 29.
- Show and tell: A neural image caption generator. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3156–3164.
- Meta-learning for domain generalization in semantic parsing. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 366–379.
- Adapting grounded visual question answering models to low resource languages. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2595–2604.
- Visual entailment: A novel task for fine-grained image understanding. arXiv preprint arXiv:1901.06706.
- Gradient surgery for multi-task learning. Advances in Neural Information Processing Systems, 33:5824–5836.
- Contrastive learning of medical visual representations from paired images and text. arXiv preprint arXiv:2010.00747.
- Deep supervised cross-modal retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10394–10403.
- Uc2: Universal cross-lingual cross-modal vision-and-language pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4155–4165.
- Visual reasoning with natural language. arXiv preprint arXiv:1710.00453.
- Hanxu Hu (9 papers)
- Frank Keller (45 papers)