Papers
Topics
Authors
Recent
2000 character limit reached

What Is Missing in Multilingual Visual Reasoning and How to Fix It (2403.01404v2)

Published 3 Mar 2024 in cs.CL

Abstract: NLP models today strive for supporting multiple languages and modalities, improving accessibility for diverse users. In this paper, we evaluate their multilingual, multimodal capabilities by testing on a visual reasoning task. We observe that proprietary systems like GPT-4V obtain the best performance on this task now, but open models lag in comparison. Surprisingly, GPT-4V exhibits similar performance between English and other languages, indicating the potential for equitable system development across languages. Our analysis on model failures reveals three key aspects that make this task challenging: multilinguality, complex reasoning, and multimodality. To address these challenges, we propose three targeted interventions including a translate-test approach to tackle multilinguality, a visual programming approach to break down complex reasoning, and a method that leverages image captioning to address multimodality. Our interventions achieve the best open performance on this task in a zero-shot setting, boosting open models LLaVA-v1.5-13B by 13.4%, LLaVA-v1.6-34B by 20.3%, and Qwen-VL by 16.7%, while also minorly improving GPT-4V's performance.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (41)
  1. The elephant in the room: Analyzing the presence of big tech in natural language processing research. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13141–13160, Toronto, Canada. Association for Computational Linguistics.
  2. Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
  3. Do all languages cost the same? tokenization in the era of commercial language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 9904–9923, Singapore. Association for Computational Linguistics.
  4. Systematic inequalities in language technology performance across the world’s languages. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5486–5505, Dublin, Ireland. Association for Computational Linguistics.
  5. Behind the scene: Revealing the secrets of pre-trained vision-and-language models. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part VI 16, pages 565–580. Springer.
  6. Uniter: Universal image-text representation learning. In European conference on computer vision, pages 104–120. Springer.
  7. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.
  8. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416.
  9. Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8440–8451, Online. Association for Computational Linguistics.
  10. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee.
  11. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
  12. Pal: Program-aided language models. In International Conference on Machine Learning, pages 10764–10799. PMLR.
  13. mblip: Efficient bootstrapping of multilingual vision-llms. arXiv, abs/2307.06930.
  14. Tanmay Gupta and Aniruddha Kembhavi. 2023. Visual programming: Compositional visual reasoning without training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14953–14962.
  15. Challenges and strategies in cross-cultural NLP. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6997–7013, Dublin, Ireland. Association for Computational Linguistics.
  16. Promptcap: Prompt-guided task-aware image captioning. arXiv preprint arXiv:2211.09699.
  17. The state and fate of linguistic diversity and inclusion in the NLP world. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 6282–6293, Online. Association for Computational Linguistics.
  18. Vilt: Vision-and-language transformer without convolution or region supervision. In International Conference on Machine Learning, pages 5583–5594. PMLR.
  19. Chatgpt beyond english: Towards a comprehensive evaluation of large language models in multilingual learning. arXiv preprint arXiv:2304.05613.
  20. Pretrained language models for text generation: A survey. arXiv preprint arXiv:2201.05273.
  21. Visually grounded reasoning across languages and cultures. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 10467–10485, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  22. Improved baselines with visual instruction tuning.
  23. Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744.
  24. Visual instruction tuning. In NeurIPS.
  25. Trends in integration of vision and language research: A survey of tasks, datasets, and methods. Journal of Artificial Intelligence Research, 71:1183–1317.
  26. Show your work: Scratchpads for intermediate computation with language models. arXiv preprint arXiv:2112.00114.
  27. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
  28. Sang-Min Park and Young-Gab Kim. 2023. Visual language integration: A survey and open challenges. Computer Science Review, 48:100548.
  29. Theory and applications of ontology: Computer applications. Springer.
  30. Imagenet large scale visual recognition challenge. International journal of computer vision, 115:211–252.
  31. GlobalBench: A benchmark for global progress in natural language processing. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 14157–14171, Singapore. Association for Computational Linguistics.
  32. A corpus for reasoning about natural language grounded in photographs. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 6418–6428, Florence, Italy. Association for Computational Linguistics.
  33. Vipergpt: Visual inference via python execution for reasoning. Proceedings of IEEE International Conference on Computer Vision (ICCV).
  34. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  35. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837.
  36. Visual chatgpt: Talking, drawing and editing with visual foundation models. arXiv preprint arXiv:2303.04671.
  37. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144.
  38. The dawn of lmms: Preliminary explorations with gpt-4v(ision). arXiv, abs/2309.17421.
  39. Empowering llm-based machine translation with cultural awareness. arXiv preprint arXiv:2305.14328.
  40. Cross-view language modeling: Towards unified cross-lingual cross-modal pre-training. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5731–5746, Toronto, Canada. Association for Computational Linguistics.
  41. Least-to-most prompting enables complex reasoning in large language models. arXiv preprint arXiv:2205.10625.
Citations (4)

Summary

We haven't generated a summary for this paper yet.

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.