MangaVQA and MangaLMM: A Benchmark and Specialized Model for Multimodal Manga Understanding (2505.20298v1)

Published 26 May 2025 in cs.CL, cs.AI, and cs.CV

Abstract: Manga, or Japanese comics, is a richly multimodal narrative form that blends images and text in complex ways. Teaching large multimodal models (LMMs) to understand such narratives at a human-like level could help manga creators reflect on and refine their stories. To this end, we introduce two benchmarks for multimodal manga understanding: MangaOCR, which targets in-page text recognition, and MangaVQA, a novel benchmark designed to evaluate contextual understanding through visual question answering. MangaVQA consists of 526 high-quality, manually constructed question-answer pairs, enabling reliable evaluation across diverse narrative and visual scenarios. Building on these benchmarks, we develop MangaLMM, a manga-specialized model finetuned from the open-source LMM Qwen2.5-VL to jointly handle both tasks. Through extensive experiments, including comparisons with proprietary models such as GPT-4o and Gemini 2.5, we assess how well LMMs understand manga. Our benchmark and model provide a comprehensive foundation for evaluating and advancing LMMs in the richly narrative domain of manga.

PDF Abstract

MangaVQA and MangaLMM: A Benchmark and Specialized Model for Multimodal Manga Understanding

This paper presents a detailed exploration into the field of multimodal manga understanding, focusing on two critical aspects: MangaVQA (Visual Question Answering) and MangaOCR (Optical Character Recognition). It proposes novel benchmarks and a specialized large multimodal model (LMM) named MangaLMM to address the inherent complexity of manga, which combines intricate visual storytelling with embedded textual content.

MangaVQA and MangaOCR Benchmarks

The introduction of MangaVQA is particularly noteworthy. It consists of 526 high-quality, manually constructed question-answer pairs, designed to challenge LMMs in comprehending both visual and textual narratives within manga. This benchmark extends beyond panel-level understanding to two-page spreads, accommodating the natural reading context for manga and demanding deeper semantic comprehension.

Complementing MangaVQA is MangaOCR, a benchmark focused on the meticulous detection and recognition of textual content embedded within manga images, such as dialogues and sound effects. This is constructed by consolidating annotations from existing datasets, Manga109 and the manga onomatopoeia dataset, accumulating approximately 209K narrative text instances. The focus here is on precise localization and recognition of text, which are crucial for understanding manga as a multimodal narrative medium.

Development of MangaLMM

The paper proceeds to describe the creation of MangaLMM, a bespoke model finetuned from the open-source LMM Qwen2.5-VL. MangaLMM is distinctively trained to jointly address both MangaOCR and MangaVQA tasks, setting a practical baseline for manga understanding. The finetuning process involves adaptively training MangaLMM using synthetic VQA examples generated from GPT-4o, demonstrating an impressive capability to handle manga's unique combination of visual and textual information.

Experimental Insights

The experimental results presented in the paper reveal significant insights. Notably, even the state-of-the-art proprietary models like GPT-4o struggle with the complexity of MangaOCR, achieving near-zero scores, yet they exhibit partial comprehension of text enabling accurate VQA responses. MangaLMM, in contrast, achieves a substantial score of 71.5% on MangaOCR and outperforms GPT-4o in MangaVQA evaluation with a score of 6.57 against 5.76, showcasing its efficiency in multimodal manga understanding.

The paper also addresses task interference in multi-task learning, noting a slight performance drop in OCR when trained jointly on both tasks, yet a marginal gain in VQA performance. This suggests beneficial task synergy facilitated by enhanced OCR capabilities, although further refinements may optimize these interactions.

Implications and Future Directions

The implications of this research are multifaceted. Practically, MangaLMM could serve as a tool aiding manga creators in evaluating and refining their narratives by functioning as a virtual editor capable of human-like manga understanding. It represents a significant advancement in machine comprehension of complex visual-textual datasets, paving the way for more nuanced AI applications in creative industries.

Theoretically, the benchmarks and methodology set forth provide a foundational framework for evaluating future models in multimodal domains, extending beyond manga to other narrative forms combining text and imagery. Furthermore, it prompts exploration into improved finetuning techniques that mitigate task interference while enhancing overall model performance.

In conclusion, this paper contributes valuable benchmarks and methodologies to the field of multimodal understanding in AI, specifically tailored to the intricacies of manga. By releasing open benchmarks, synthetic data, and the MangaLMM model, it offers a comprehensive resource that can catalyze further advancements in multimodal AI research.

PDF Markdown Bookmark Chat (Pro)

Authors (7)

Jeonghun Baek (11 papers)
Kazuki Egashira (5 papers)
Shota Onohara (3 papers)
Atsuyuki Miyai (10 papers)
Yuki Imajuku (6 papers)
Hikaru Ikuta (4 papers)
Kiyoharu Aizawa (67 papers)

Related Papers

Find Related Papers

Tweets

https://twitter.com/AtsuMiyaiAM/status/1927211582922080621

https://twitter.com/arxivsanitybot/status/1927558984543732057