Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Heron-Bench: A Benchmark for Evaluating Vision Language Models in Japanese (2404.07824v1)

Published 11 Apr 2024 in cs.CV and cs.CL

Abstract: Vision LLMs (VLMs) have undergone a rapid evolution, giving rise to significant advancements in the realm of multimodal understanding tasks. However, the majority of these models are trained and evaluated on English-centric datasets, leaving a gap in the development and evaluation of VLMs for other languages, such as Japanese. This gap can be attributed to the lack of methodologies for constructing VLMs and the absence of benchmarks to accurately measure their performance. To address this issue, we introduce a novel benchmark, Japanese Heron-Bench, for evaluating Japanese capabilities of VLMs. The Japanese Heron-Bench consists of a variety of imagequestion answer pairs tailored to the Japanese context. Additionally, we present a baseline Japanese VLM that has been trained with Japanese visual instruction tuning datasets. Our Heron-Bench reveals the strengths and limitations of the proposed VLM across various ability dimensions. Furthermore, we clarify the capability gap between strong closed models like GPT-4V and the baseline model, providing valuable insights for future research in this domain. We release the benchmark dataset and training code to facilitate further developments in Japanese VLM research.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (37)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  2. OpenAI. Gpt-4v(ision) system card, 2023.
  3. tokyotech-llm/swallow-7b-hf, 2024.
  4. Ishigami Ryosuke. cyberagent/open-calm-7b, 2023.
  5. Japanese stablelm base alpha 7b.
  6. stabilityai/japanese-stablelm-base-gamma-7b, 2023.
  7. rinna/nekomata-7b, 2024.
  8. Elyza-japanese-llama-2-7b, 2023.
  9. Visual instruction tuning. In A. Oh, T. Neumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors, Advances in Neural Information Processing Systems, volume 36, pages 34892–34916. Curran Associates, Inc., 2023.
  10. Improved baselines with visual instruction tuning, 2023.
  11. Llava-next: Improved reasoning, ocr, and world knowledge, January 2024.
  12. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond, 2023.
  13. Blip-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In Proceedings of the 40th International Conference on Machine Learning, ICML’23. JMLR.org, 2023.
  14. InstructBLIP: Towards general-purpose vision-language models with instruction tuning. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
  15. Minigpt-4: Enhancing vision-language understanding with advanced large language models, 2023.
  16. Introducing our multimodal models, 2023.
  17. An empirical study of scaling instruct-tuned large multimodal models, 2023.
  18. Mini-gemini: Mining the potential of multi-modality vision language models, 2024.
  19. Bleu: a method for automatic evaluation of machine translation. In Pierre Isabelle, Eugene Charniak, and Dekang Lin, editors, Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA, July 2002. Association for Computational Linguistics.
  20. Cider: Consensus-based image description evaluation. In CVPR, pages 4566–4575. IEEE Computer Society, 2015.
  21. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Jade Goldstein, Alon Lavie, Chin-Yew Lin, and Clare Voss, editors, Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pages 65–72, Ann Arbor, Michigan, June 2005. Association for Computational Linguistics.
  22. Chin-Yew Lin. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain, July 2004. Association for Computational Linguistics.
  23. CLIPScore: A reference-free evaluation metric for image captioning. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih, editors, Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7514–7528, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics.
  24. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
  25. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
  26. Vizwiz grand challenge: Answering visual questions from blind people. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3608–3617, 06 2018.
  27. Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394, 2023.
  28. Mmbench: Is your multi-modal model an all-around player?, 2023.
  29. Touchstone: Evaluating vision-language models by language models, 2023.
  30. Git: A generative image-to-text transformer for vision and language, 2022.
  31. BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato, editors, Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 12888–12900. PMLR, 17–23 Jul 2022.
  32. Flamingo: a visual language model for few-shot learning. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 23716–23736. Curran Associates, Inc., 2022.
  33. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  34. Anthropic. Introducing the next generation of claude. available at: https://www.anthropic.com/news/claude-3-family.
  35. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530, 2024.
  36. Japanese stable vlm.
  37. Evolutionary optimization of model merging recipes, 2024.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Yuichi Inoue (5 papers)
  2. Kento Sasaki (31 papers)
  3. Yuma Ochi (1 paper)
  4. Kazuki Fujii (14 papers)
  5. Kotaro Tanahashi (9 papers)
  6. Yu Yamaguchi (8 papers)
Citations (4)