MangaVQA and MangaLMM: A Benchmark and Specialized Model for Multimodal Manga Understanding
This paper presents a detailed exploration into the field of multimodal manga understanding, focusing on two critical aspects: MangaVQA (Visual Question Answering) and MangaOCR (Optical Character Recognition). It proposes novel benchmarks and a specialized large multimodal model (LMM) named MangaLMM to address the inherent complexity of manga, which combines intricate visual storytelling with embedded textual content.
MangaVQA and MangaOCR Benchmarks
The introduction of MangaVQA is particularly noteworthy. It consists of 526 high-quality, manually constructed question-answer pairs, designed to challenge LMMs in comprehending both visual and textual narratives within manga. This benchmark extends beyond panel-level understanding to two-page spreads, accommodating the natural reading context for manga and demanding deeper semantic comprehension.
Complementing MangaVQA is MangaOCR, a benchmark focused on the meticulous detection and recognition of textual content embedded within manga images, such as dialogues and sound effects. This is constructed by consolidating annotations from existing datasets, Manga109 and the manga onomatopoeia dataset, accumulating approximately 209K narrative text instances. The focus here is on precise localization and recognition of text, which are crucial for understanding manga as a multimodal narrative medium.
Development of MangaLMM
The paper proceeds to describe the creation of MangaLMM, a bespoke model finetuned from the open-source LMM Qwen2.5-VL. MangaLMM is distinctively trained to jointly address both MangaOCR and MangaVQA tasks, setting a practical baseline for manga understanding. The finetuning process involves adaptively training MangaLMM using synthetic VQA examples generated from GPT-4o, demonstrating an impressive capability to handle manga's unique combination of visual and textual information.
Experimental Insights
The experimental results presented in the paper reveal significant insights. Notably, even the state-of-the-art proprietary models like GPT-4o struggle with the complexity of MangaOCR, achieving near-zero scores, yet they exhibit partial comprehension of text enabling accurate VQA responses. MangaLMM, in contrast, achieves a substantial score of 71.5% on MangaOCR and outperforms GPT-4o in MangaVQA evaluation with a score of 6.57 against 5.76, showcasing its efficiency in multimodal manga understanding.
The paper also addresses task interference in multi-task learning, noting a slight performance drop in OCR when trained jointly on both tasks, yet a marginal gain in VQA performance. This suggests beneficial task synergy facilitated by enhanced OCR capabilities, although further refinements may optimize these interactions.
Implications and Future Directions
The implications of this research are multifaceted. Practically, MangaLMM could serve as a tool aiding manga creators in evaluating and refining their narratives by functioning as a virtual editor capable of human-like manga understanding. It represents a significant advancement in machine comprehension of complex visual-textual datasets, paving the way for more nuanced AI applications in creative industries.
Theoretically, the benchmarks and methodology set forth provide a foundational framework for evaluating future models in multimodal domains, extending beyond manga to other narrative forms combining text and imagery. Furthermore, it prompts exploration into improved finetuning techniques that mitigate task interference while enhancing overall model performance.
In conclusion, this paper contributes valuable benchmarks and methodologies to the field of multimodal understanding in AI, specifically tailored to the intricacies of manga. By releasing open benchmarks, synthetic data, and the MangaLMM model, it offers a comprehensive resource that can catalyze further advancements in multimodal AI research.