Babel-ImageNet: Massively Multilingual Evaluation of Vision-and-Language Representations (2306.08658v2)
Abstract: Vision-and-language (VL) models with separate encoders for each modality (e.g., CLIP) have become the go-to models for zero-shot image classification and image-text retrieval. They are, however, mostly evaluated in English as multilingual benchmarks are limited in availability. We introduce Babel-ImageNet, a massively multilingual benchmark that offers (partial) translations of ImageNet labels to 100 languages, built without machine translation or manual annotation. We instead automatically obtain reliable translations by linking them -- via shared WordNet synsets -- to BabelNet, a massively multilingual lexico-semantic network. We evaluate 11 public multilingual CLIP models on zero-shot image classification (ZS-IC) on our benchmark, demonstrating a significant gap between English ImageNet performance and that of high-resource languages (e.g., German or Chinese), and an even bigger gap for low-resource languages (e.g., Sinhala or Lao). Crucially, we show that the models' ZS-IC performance highly correlates with their performance in image-text retrieval, validating the use of Babel-ImageNet to evaluate multilingual models for the vast majority of languages without gold image-text data. Finally, we show that the performance of multilingual CLIP can be drastically improved for low-resource languages with parameter-efficient language-specific training. We make our code and data publicly available: \url{https://github.com/gregor-ge/Babel-ImageNet}
- P. Aggarwal and A. Kale. Towards Zero-shot Cross-lingual Image Retrieval. CoRR, abs/2012.05107, 2020. URL https://arxiv.org/abs/2012.05107. arXiv: 2012.05107.
- Findings of the Third Shared Task on Multimodal Machine Translation. In Proceedings of the Third Conference on Machine Translation: Shared Task Papers, pages 304–323, Belgium, Brussels, Oct. 2018. Association for Computational Linguistics. doi: 10.18653/v1/W18-6402. URL https://www.aclweb.org/anthology/W18-6402.
- Contrastive Language-Image Pre-training for the Italian Language. CoRR, abs/2108.08688, 2021. URL https://arxiv.org/abs/2108.08688. arXiv: 2108.08688.
- IGLUE: A benchmark for transfer learning across modalities, tasks, and languages. In K. Chaudhuri, S. Jegelka, L. Song, C. Szepesvári, G. Niu, and S. Sabato, editors, International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, volume 162 of Proceedings of Machine Learning Research, pages 2370–2392. PMLR, 2022. URL https://proceedings.mlr.press/v162/bugliarello22a.html.
- Learning to Scale Multilingual Representations for Vision-Language Tasks. In Computer Vision - ECCV 2020 - 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part IV, volume 12349 of Lecture Notes in Computer Science, pages 197–213. Springer, 2020. doi: 10.1007/978-3-030-58548-8_12. URL https://doi.org/10.1007/978-3-030-58548-8_12.
- Cross-lingual and Multilingual CLIP. In N. Calzolari, F. Béchet, P. Blache, K. Choukri, C. Cieri, T. Declerck, S. Goggi, H. Isahara, B. Maegaard, J. Mariani, H. Mazo, J. Odijk, and S. Piperidis, editors, Proceedings of the Thirteenth Language Resources and Evaluation Conference, LREC 2022, Marseille, France, 20-25 June 2022, pages 6848–6854. European Language Resources Association, 2022. URL https://aclanthology.org/2022.lrec-1.739.
- AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities. CoRR, abs/2211.06679, 2022. doi: 10.48550/arXiv.2211.06679. URL https://doi.org/10.48550/arXiv.2211.06679. arXiv: 2211.06679.
- Unsupervised Cross-lingual Representation Learning at Scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, pages 8440–8451. Association for Computational Linguistics, 2020. doi: 10.18653/v1/2020.acl-main.747. URL https://doi.org/10.18653/v1/2020.acl-main.747.
- No Language Left Behind: Scaling Human-Centered Machine Translation. CoRR, abs/2207.04672, 2022. doi: 10.48550/arXiv.2207.04672. URL https://doi.org/10.48550/arXiv.2207.04672. arXiv: 2207.04672.
- ImageNet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009. doi: 10.1109/CVPR.2009.5206848.
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pages 4171–4186. Association for Computational Linguistics, 2019. doi: 10.18653/v1/n19-1423. URL https://doi.org/10.18653/v1/n19-1423.
- Does Object Recognition Work for Everyone? In IEEE Conference on Computer Vision and Pattern Recognition Workshops, CVPR Workshops 2019, Long Beach, CA, USA, June 16-20, 2019, pages 52–59. Computer Vision Foundation / IEEE, 2019. URL http://openaccess.thecvf.com/content_CVPRW_2019/html/cv4gc/de_Vries_Does_Object_Recognition_Work_for_Everyone_CVPRW_2019_paper.html.
- An image is worth 16x16 words: Transformers for image recognition at scale. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021. URL https://openreview.net/forum?id=YicbFdNTTy.
- MAGMA - multimodal augmentation of generative models through adapter-based finetuning. In Y. Goldberg, Z. Kozareva, and Y. Zhang, editors, Findings of the Association for Computational Linguistics: EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7-11, 2022, pages 2416–2428. Association for Computational Linguistics, 2022. URL https://aclanthology.org/2022.findings-emnlp.179.
- Multi30K: Multilingual English-German Image Descriptions. In Proceedings of the 5th Workshop on Vision and Language, pages 70–74, Berlin, Germany, Aug. 2016. Association for Computational Linguistics. doi: 10.18653/v1/W16-3210. URL https://www.aclweb.org/anthology/W16-3210.
- Findings of the Second Shared Task on Multimodal Machine Translation and Multilingual Image Description. In Proceedings of the Second Conference on Machine Translation, pages 215–233, Copenhagen, Denmark, Sept. 2017. Association for Computational Linguistics. doi: 10.18653/v1/W17-4718. URL https://www.aclweb.org/anthology/W17-4718.
- VSE++: Improving Visual-Semantic Embeddings with Hard Negatives. In British Machine Vision Conference 2018, BMVC 2018, Newcastle, UK, September 3-6, 2018, page 12. BMVA Press, 2018. URL http://bmvc2018.org/contents/papers/0344.pdf.
- Does progress on ImageNet transfer to real-world datasets? CoRR, abs/2301.04644, 2023. doi: 10.48550/arXiv.2301.04644. URL https://doi.org/10.48550/arXiv.2301.04644. arXiv: 2301.04644.
- SimCSE: Simple Contrastive Learning of Sentence Embeddings. In M.-F. Moens, X. Huang, L. Specia, and S. W.-t. Yih, editors, Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021, pages 6894–6910. Association for Computational Linguistics, 2021. doi: 10.18653/v1/2021.emnlp-main.552. URL https://doi.org/10.18653/v1/2021.emnlp-main.552.
- Retrieve Fast, Rerank Smart: Cooperative and Joint Approaches for Improved Cross-Modal Retrieval. Transactions of the Association for Computational Linguistics, 10:503–521, 2022. doi: 10.1162/tacl_a_00473. URL https://doi.org/10.1162/tacl_a_00473.
- Image Pivoting for Learning Multilingual Multimodal Representations. In M. Palmer, R. Hwa, and S. Riedel, editors, Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, EMNLP 2017, Copenhagen, Denmark, September 9-11, 2017, pages 2839–2845. Association for Computational Linguistics, 2017. doi: 10.18653/v1/d17-1303. URL https://doi.org/10.18653/v1/d17-1303.
- B. Hamp and H. Feldweg. Germanet-a lexical-semantic net for german. In Automatic information extraction and building of lexical semantic resources for NLP applications, 1997.
- Parameter-Efficient Transfer Learning for NLP. In K. Chaudhuri and R. Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, volume 97 of Proceedings of Machine Learning Research, pages 2790–2799. PMLR, 2019. URL http://proceedings.mlr.press/v97/houlsby19a.html.
- LoRA: Low-Rank Adaptation of Large Language Models. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022. URL https://openreview.net/forum?id=nZeVKeeFYf9.
- Xtreme: A massively multilingual multi-task benchmark for evaluating cross-lingual generalisation. In International Conference on Machine Learning, pages 4411–4421. PMLR, 2020.
- OpenCLIP, July 2021. URL https://doi.org/10.5281/zenodo.5143773.
- MURAL: Multimodal, Multitask Representations Across Languages. In M.-F. Moens, X. Huang, L. Specia, and S. W.-t. Yih, editors, Findings of the Association for Computational Linguistics: EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 16-20 November, 2021, pages 3449–3463. Association for Computational Linguistics, 2021. doi: 10.18653/v1/2021.findings-emnlp.293. URL https://doi.org/10.18653/v1/2021.findings-emnlp.293.
- Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision. In M. Meila and T. Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pages 4904–4916. PMLR, 2021. URL http://proceedings.mlr.press/v139/jia21b.html.
- A. Karpathy and L. Fei-Fei. Deep Visual-Semantic Alignments for Generating Image Descriptions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(4):664–676, 2017. doi: 10.1109/TPAMI.2016.2598339.
- MULE: Multimodal Universal Language Embedding. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020, pages 11254–11261. AAAI Press, 2020. URL https://aaai.org/ojs/index.php/AAAI/article/view/6785.
- From zero to hero: On the limitations of zero-shot language transfer with multilingual transformers. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4483–4499, 2020.
- BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation. In K. Chaudhuri, S. Jegelka, L. Song, C. Szepesvári, G. Niu, and S. Sabato, editors, International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, volume 162 of Proceedings of Machine Learning Research, pages 12888–12900. PMLR, 2022. URL https://proceedings.mlr.press/v162/li22n.html.
- BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. CoRR, abs/2301.12597, 2023. doi: 10.48550/arXiv.2301.12597. URL https://doi.org/10.48550/arXiv.2301.12597. arXiv: 2301.12597.
- COCO-CN for Cross-Lingual Image Tagging, Captioning, and Retrieval. IEEE Trans. Multim., 21(9):2347–2360, 2019. doi: 10.1109/TMM.2019.2896494. URL https://doi.org/10.1109/TMM.2019.2896494.
- Microsoft COCO: Common Objects in Context. In Computer Vision - ECCV 2014 - 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V, volume 8693 of Lecture Notes in Computer Science, pages 740–755. Springer, 2014. doi: 10.1007/978-3-319-10602-1_48. URL https://doi.org/10.1007/978-3-319-10602-1_48.
- Visually Grounded Reasoning across Languages and Cultures. In M.-F. Moens, X. Huang, L. Specia, and S. W.-t. Yih, editors, Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021, pages 10467–10485. Association for Computational Linguistics, 2021. doi: 10.18653/v1/2021.emnlp-main.818. URL https://doi.org/10.18653/v1/2021.emnlp-main.818.
- I. Loshchilov and F. Hutter. Decoupled Weight Decay Regularization. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net, 2019. URL https://openreview.net/forum?id=Bkg6RiCqY7.
- G. A. Miller. WordNet: A Lexical Database for English. In Human Language Technology: Proceedings of a Workshop held at Plainsboro, New Jersey, March 8-11, 1994, 1994. URL https://aclanthology.org/H94-1111.
- R. Navigli and S. P. Ponzetto. BabelNet: Building a Very Large Multilingual Semantic Network. In J. Hajic, S. Carberry, and S. Clark, editors, ACL 2010, Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, July 11-16, 2010, Uppsala, Sweden, pages 216–225. The Association for Computer Linguistics, 2010. URL https://aclanthology.org/P10-1023/.
- M3P: Learning Universal Representations via Multitask Multilingual Multimodal Pre-Training. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021, pages 3977–3986. Computer Vision Foundation / IEEE, 2021. doi: 10.1109/CVPR46437.2021.00397. URL https://openaccess.thecvf.com/content/CVPR2021/html/Ni_M3P_Learning_Universal_Representations_via_Multitask_Multilingual_Multimodal_Pre-Training_CVPR_2021_paper.html.
- F. Å. Nielsen. Linking ImageNet WordNet Synsets with Wikidata. In P.-A. Champin, F. Gandon, M. Lalmas, and P. G. Ipeirotis, editors, Companion of the The Web Conference 2018 on The Web Conference 2018, WWW 2018, Lyon , France, April 23-27, 2018, pages 1809–1814. ACM, 2018. doi: 10.1145/3184558.3191645. URL https://doi.org/10.1145/3184558.3191645.
- M.-E. Nilsback and A. Zisserman. Automated Flower Classification over a Large Number of Classes. In Sixth Indian Conference on Computer Vision, Graphics & Image Processing, ICVGIP 2008, Bhubaneswar, India, 16-19 December 2008, pages 722–729. IEEE Computer Society, 2008. doi: 10.1109/ICVGIP.2008.47. URL https://doi.org/10.1109/ICVGIP.2008.47.
- Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks. In J. Vanschoren and S.-K. Yeung, editors, Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Benchmarks 2021, December 2021, virtual, 2021. URL https://datasets-benchmarks-proceedings.neurips.cc/paper/2021/hash/f2217062e9a397a1dca429e7d70bc6ca-Abstract-round1.html.
- Cats and dogs. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, June 16-21, 2012, pages 3498–3505. IEEE Computer Society, 2012. doi: 10.1109/CVPR.2012.6248092. URL https://doi.org/10.1109/CVPR.2012.6248092.
- AdapterHub: A Framework for Adapting Transformers. In Q. Liu and D. Schlangen, editors, Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, EMNLP 2020 - Demos, Online, November 16-20, 2020, pages 46–54. Association for Computational Linguistics, 2020a. doi: 10.18653/v1/2020.emnlp-demos.7. URL https://doi.org/10.18653/v1/2020.emnlp-demos.7.
- MAD-X: An Adapter-Based Framework for Multi-Task Cross-Lingual Transfer. In B. Webber, T. Cohn, Y. He, and Y. Liu, editors, Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November 16-20, 2020, pages 7654–7673. Association for Computational Linguistics, 2020b. doi: 10.18653/v1/2020.emnlp-main.617. URL https://doi.org/10.18653/v1/2020.emnlp-main.617.
- xGQA: Cross-Lingual Visual Question Answering. In S. Muresan, P. Nakov, and A. Villavicencio, editors, Findings of the Association for Computational Linguistics: ACL 2022, Dublin, Ireland, May 22-27, 2022, pages 2497–2511. Association for Computational Linguistics, 2022. doi: 10.18653/v1/2022.findings-acl.196. URL https://doi.org/10.18653/v1/2022.findings-acl.196.
- Combined Scaling for Zero-shot Transfer Learning. CoRR, abs/2111.10050, 2021. URL https://arxiv.org/abs/2111.10050. arXiv: 2111.10050.
- Multiwordnet: developing an aligned multilingual database. In First international conference on global WordNet, pages 293–302, 2002.
- Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models. In 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7-13, 2015, pages 2641–2649, 2015. doi: 10.1109/ICCV.2015.303. URL https://doi.org/10.1109/ICCV.2015.303.
- Learning Transferable Visual Models From Natural Language Supervision. In M. Meila and T. Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pages 8748–8763. PMLR, 2021. URL http://proceedings.mlr.press/v139/radford21a.html.
- Do ImageNet Classifiers Generalize to ImageNet? In K. Chaudhuri and R. Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, volume 97 of Proceedings of Machine Learning Research, pages 5389–5400. PMLR, 2019. URL http://proceedings.mlr.press/v97/recht19a.html.
- N. Reimers and I. Gurevych. Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation. In B. Webber, T. Cohn, Y. He, and Y. Liu, editors, Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November 16-20, 2020, pages 4512–4525. Association for Computational Linguistics, 2020. doi: 10.18653/v1/2020.emnlp-main.365. URL https://doi.org/10.18653/v1/2020.emnlp-main.365.
- High-Resolution Image Synthesis with Latent Diffusion Models. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pages 10674–10685. IEEE, 2022. doi: 10.1109/CVPR52688.2022.01042. URL https://doi.org/10.1109/CVPR52688.2022.01042.
- LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs. CoRR, abs/2111.02114, 2021. URL https://arxiv.org/abs/2111.02114. arXiv: 2111.02114.
- LAION-5B: An open large-scale dataset for training next generation image-text models. In NeurIPS, 2022. URL http://papers.nips.cc/paper_files/paper/2022/hash/a1859debfb3b59d094f3504d5ebb6c25-Abstract-Datasets_and_Benchmarks.html.
- Wikimatrix: Mining 135m parallel sentences in 1620 language pairs from wikipedia. In P. Merlo, J. Tiedemann, and R. Tsarfaty, editors, Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, EACL 2021, Online, April 19 - 23, 2021, pages 1351–1361. Association for Computational Linguistics, 2021. doi: 10.18653/v1/2021.eacl-main.115. URL https://doi.org/10.18653/v1/2021.eacl-main.115.
- No Classification without Representation: Assessing Geodiversity Issues in Open Data Sets for the Developing World. CoRR, abs/1711.08536, 2017. URL https://arxiv.org/abs/1711.08536. arxiv: 1711.08536.
- WIT: Wikipedia-based Image Text Dataset for Multimodal Multilingual Machine Learning. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’21, pages 2443–2449, New York, NY, USA, July 2021. Association for Computing Machinery. ISBN 978-1-4503-8037-9. doi: 10.1145/3404835.3463257. URL https://doi.org/10.1145/3404835.3463257.
- Crossmodal-3600: A Massively Multilingual Multimodal Evaluation Dataset. In Y. Goldberg, Z. Kozareva, and Y. Zhang, editors, Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7-11, 2022, pages 715–729. Association for Computational Linguistics, 2022. URL https://aclanthology.org/2022.emnlp-main.45.
- D. Vrandečić. Wikidata: A new platform for collaborative data collection. In Proceedings of the 21st international conference on world wide web, pages 1063–1064, 2012.
- Language-Agnostic Visual-Semantic Embeddings. In 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019, pages 5803–5812. IEEE, 2019. doi: 10.1109/ICCV.2019.00590. URL https://doi.org/10.1109/ICCV.2019.00590.
- Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese. CoRR, abs/2211.01335, 2022. doi: 10.48550/arXiv.2211.01335. URL https://doi.org/10.48550/arXiv.2211.01335. arXiv: 2211.01335.
- STAIR Captions: Constructing a Large-Scale Japanese Image Caption Dataset. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 417–421, Vancouver, Canada, July 2017. Association for Computational Linguistics. doi: 10.18653/v1/P17-2066. URL https://www.aclweb.org/anthology/P17-2066.
- Cross-View Language Modeling: Towards Unified Cross-Lingual Cross-Modal Pre-training. CoRR, abs/2206.00621, 2022. doi: 10.48550/arXiv.2206.00621. URL https://doi.org/10.48550/arXiv.2206.00621. arXiv: 2206.00621.
- LiT: Zero-Shot Transfer with Locked-image text Tuning. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pages 18102–18112. IEEE, 2022. doi: 10.1109/CVPR52688.2022.01759. URL https://doi.org/10.1109/CVPR52688.2022.01759.
- Sigmoid Loss for Language Image Pre-Training. CoRR, abs/2303.15343, 2023. doi: 10.48550/arXiv.2303.15343. URL https://doi.org/10.48550/arXiv.2303.15343. arXiv: 2303.15343.
- Generalizing Multimodal Pre-training into Multilingual via Language Acquisition. CoRR, abs/2206.11091, 2022. doi: 10.48550/arXiv.2206.11091. URL https://doi.org/10.48550/arXiv.2206.11091. arXiv: 2206.11091.
- UC2: Universal Cross-Lingual Cross-Modal Vision-and-Language Pre-Training. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021, pages 4155–4165. Computer Vision Foundation / IEEE, 2021. doi: 10.1109/CVPR46437.2021.00414. URL https://openaccess.thecvf.com/content/CVPR2021/html/Zhou_UC2_Universal_Cross-Lingual_Cross-Modal_Vision-and-Language_Pre-Training_CVPR_2021_paper.html.
- Gregor Geigle (12 papers)
- Radu Timofte (299 papers)
- Goran Glavaš (82 papers)