Papers
Topics
Authors
Recent
2000 character limit reached

Towards Cross-Lingual Explanation of Artwork in Large-scale Vision Language Models (2409.01584v2)

Published 3 Sep 2024 in cs.CL

Abstract: As the performance of Large-scale Vision LLMs (LVLMs) improves, they are increasingly capable of responding in multiple languages, and there is an expectation that the demand for explanations generated by LVLMs will grow. However, pre-training of Vision Encoder and the integrated training of LLMs with Vision Encoder are mainly conducted using English training data, leaving it uncertain whether LVLMs can completely handle their potential when generating explanations in languages other than English. In addition, multilingual QA benchmarks that create datasets using machine translation have cultural differences and biases, remaining issues for use as evaluation tasks. To address these challenges, this study created an extended dataset in multiple languages without relying on machine translation. This dataset that takes into account nuances and country-specific phrases was then used to evaluate the generation explanation abilities of LVLMs. Furthermore, this study examined whether Instruction-Tuning in resource-rich English improves performance in other languages. Our findings indicate that LVLMs perform worse in languages other than English compared to English. In addition, it was observed that LVLMs struggle to effectively manage the knowledge learned from English data. Our dataset is available at https://huggingface.co/datasets/naist-nlp/MultiExpArt

Definition Search Book Streamline Icon: https://streamlinehq.com
References (37)
  1. Phi-3 technical report: A highly capable language model locally on your phone. arXiv preprint arXiv:2404.14219.
  2. Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
  3. VQA: Visual Question Answering. In International Conference on Computer Vision (ICCV).
  4. Qwen technical report. arXiv preprint arXiv:2309.16609.
  5. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. arXiv preprint arXiv:2308.12966.
  6. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  7. MaXM: Towards multilingual visual question answering. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 2667–2682, Singapore. Association for Computational Linguistics.
  8. A simple framework for contrastive learning of visual representations. arXiv preprint arXiv:2002.05709.
  9. Gpt3. int8 (): 8-bit matrix multiplication for transformers at scale. Advances in Neural Information Processing Systems, 35:30318–30332.
  10. Internlm-xcomposer2: Mastering free-form text-image composition and comprehension in vision-language large model. arXiv preprint arXiv:2401.16420.
  11. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.
  12. The llama 3 herd of models. arXiv preprint arXiv:2407.21783.
  13. Towards artwork explanation in large-scale vision language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 705–729, Bangkok, Thailand. Association for Computational Linguistics.
  14. Deep residual learning for image recognition. arxiv e-prints. arXiv preprint arXiv:1512.03385, 10.
  15. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations.
  16. Seed-bench: Benchmarking multimodal llms with generative comprehension. arXiv preprint arXiv:2307.16125.
  17. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International conference on machine learning, pages 19730–19742. PMLR.
  18. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International conference on machine learning, pages 12888–12900. PMLR.
  19. Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81.
  20. Improved baselines with visual instruction tuning.
  21. Llava-next: Improved reasoning, ocr, and world knowledge.
  22. Visual instruction tuning.
  23. Mmbench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281.
  24. Vlsp2022-evjvqa challenge: Multilingual visual question answering. arXiv preprint arXiv:2302.11752.
  25. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744.
  26. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318.
  27. xGQA: Cross-lingual visual question answering. In Findings of the Association for Computational Linguistics: ACL 2022, pages 2497–2511, Dublin, Ireland. Association for Computational Linguistics.
  28. Can llm generate culturally relevant commonsense qa data? case study in indonesian and sundanese. arXiv preprint arXiv:2402.17302.
  29. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530.
  30. mcsqa: Multilingual commonsense reasoning dataset with unified creation strategy by language models and humans. arXiv preprint arXiv:2406.04215.
  31. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805.
  32. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
  33. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  34. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online. Association for Computational Linguistics.
  35. Qwen2 technical report. arXiv preprint arXiv:2407.10671.
  36. mplug-owl2: Revolutionizing multi-modal large language model with modality collaboration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13040–13051.
  37. Bertscore: Evaluating text generation with bert. arXiv preprint arXiv:1904.09675.
Citations (1)

Summary

We haven't generated a summary for this paper yet.

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

Sign up for free to view the 1 tweet with 34 likes about this paper.