Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Meta-learning For Vision-and-language Cross-lingual Transfer (2305.14843v2)

Published 24 May 2023 in cs.CL

Abstract: Current pre-trained vison-LLMs (PVLMs) achieve excellent performance on a range of multi-modal datasets. Recent work has aimed at building multilingual models, and a range of novel multilingual multi-modal datasets have been proposed. Current PVLMs typically perform poorly on these datasets when used for multi-modal zero-shot or few-shot cross-lingual transfer, especially for low-resource languages. To alleviate this problem, we propose a novel meta-learning fine-tuning framework. Our framework makes current PVLMs rapidly adaptive to new languages in vision-language scenarios by designing MAML in a cross-lingual multi-modal manner. Experiments show that our method boosts the performance of current state-of-the-art PVLMs in both zero-shot and few-shot cross-lingual transfer on a range of vision-language understanding tasks and datasets (XVNLI, xGQA, MaRVL, xFlicker&Co)

Definition Search Book Streamline Icon: https://streamlinehq.com
References (39)
  1. Iglue: A benchmark for transfer learning across modalities, tasks, and languages. arXiv preprint arXiv:2201.11732.
  2. Uniter: Universal image-text representation learning. In European conference on computer vision, pages 104–120. Springer.
  3. Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8440–8451.
  4. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186.
  5. Multi30k: Multilingual english-german image descriptions. arXiv preprint arXiv:1605.00459.
  6. Model-agnostic meta-learning for fast adaptation of deep networks. In International conference on machine learning, pages 1126–1135. PMLR.
  7. Image pivoting for learning multilingual multimodal representations. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2839–2845, Copenhagen, Denmark. Association for Computational Linguistics.
  8. Meta-learning for low-resource neural machine translation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 3622–3631, Brussels, Belgium. Association for Computational Linguistics.
  9. Drew A Hudson and Christopher D Manning. 2019. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6700–6709.
  10. Learning to generalize: Meta-learning for domain generalization. In Proceedings of the AAAI conference on artificial intelligence, volume 32.
  11. Microsoft coco: Common objects in context.
  12. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer.
  13. Visually grounded reasoning across languages and cultures. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 10467–10485.
  14. Ilya Loshchilov and Frank Hutter. 2018. Decoupled weight decay regularization. In International Conference on Learning Representations.
  15. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in neural information processing systems, 32.
  16. M3p: Learning universal representations via multitask multilingual multimodal pre-training. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3977–3986.
  17. Zero-shot cross-lingual transfer with meta learning. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4547–4562.
  18. Abiola Obamuyide and Andreas Vlachos. 2019. Model-agnostic meta-learning for relation classification with limited supervision. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 5873–5879.
  19. xgqa: Cross-lingual visual question answering. In Findings of the Association for Computational Linguistics: ACL 2022, pages 2497–2511.
  20. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In Proceedings of the IEEE international conference on computer vision, pages 2641–2649.
  21. Multilingual multimodal learning with machine translated text. arXiv preprint arXiv:2210.13134.
  22. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pages 8748–8763. PMLR.
  23. Meta-learning for semi-supervised few-shot classification. arXiv preprint arXiv:1803.00676.
  24. Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems, 28.
  25. Bridging languages through images with deep partial canonical correlation analysis. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 910–921, Melbourne, Australia. Association for Computational Linguistics.
  26. Vl-bert: Pre-training of generic visual-linguistic representations. In International Conference on Learning Representations.
  27. A corpus for reasoning about natural language grounded in photographs. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 6418–6428.
  28. Hao Tan and Mohit Bansal. 2019. Lxmert: Learning cross-modality encoder representations from transformers. arXiv preprint arXiv:1908.07490.
  29. Meta-learning for generalized zero-shot learning. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 6062–6069.
  30. Matching networks for one shot learning. Advances in neural information processing systems, 29.
  31. Show and tell: A neural image caption generator. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3156–3164.
  32. Meta-learning for domain generalization in semantic parsing. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 366–379.
  33. Adapting grounded visual question answering models to low resource languages. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2595–2604.
  34. Visual entailment: A novel task for fine-grained image understanding. arXiv preprint arXiv:1901.06706.
  35. Gradient surgery for multi-task learning. Advances in Neural Information Processing Systems, 33:5824–5836.
  36. Contrastive learning of medical visual representations from paired images and text. arXiv preprint arXiv:2010.00747.
  37. Deep supervised cross-modal retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10394–10403.
  38. Uc2: Universal cross-lingual cross-modal vision-and-language pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4155–4165.
  39. Visual reasoning with natural language. arXiv preprint arXiv:1710.00453.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Hanxu Hu (9 papers)
  2. Frank Keller (45 papers)
Citations (1)