Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
144 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Tails Tell Tales: Chapter-Wide Manga Transcriptions with Character Names (2408.00298v1)

Published 1 Aug 2024 in cs.CV

Abstract: Enabling engagement of manga by visually impaired individuals presents a significant challenge due to its inherently visual nature. With the goal of fostering accessibility, this paper aims to generate a dialogue transcript of a complete manga chapter, entirely automatically, with a particular emphasis on ensuring narrative consistency. This entails identifying (i) what is being said, i.e., detecting the texts on each page and classifying them into essential vs non-essential, and (ii) who is saying it, i.e., attributing each dialogue to its speaker, while ensuring the same characters are named consistently throughout the chapter. To this end, we introduce: (i) Magiv2, a model that is capable of generating high-quality chapter-wide manga transcripts with named characters and significantly higher precision in speaker diarisation over prior works; (ii) an extension of the PopManga evaluation dataset, which now includes annotations for speech-bubble tail boxes, associations of text to corresponding tails, classifications of text as essential or non-essential, and the identity for each character box; and (iii) a new character bank dataset, which comprises over 11K characters from 76 manga series, featuring 11.5K exemplar character images in total, as well as a list of chapters in which they appear. The code, trained model, and both datasets can be found at: https://github.com/ragavsachdeva/magi

Citations (1)

Summary

  • The paper introduces Magiv2, which enhances manga transcription by integrating graph-based detection and constraint optimization to accurately assign character names.
  • It employs a novel approach that detects panels, text, characters, and speech bubble tails while utilizing an 11K character bank for consistent naming.
  • Evaluation metrics show Magiv2 outperforms previous models in detection precision and association accuracy, advancing transcript quality and accessibility.

An Analysis of "Tails Tell Tales: Chapter-Wide Manga Transcriptions with Character Names"

Introduction

The automatic generation of chapter-wide manga transcripts poses a significant challenge due to the visual nature of manga, which hinges on graphical elements, text dialogues, and complex character interactions. This paper by Sachdeva et al. seeks to address multiple limitations seen in prior works, primarily the Magi model, by introducing Magiv2. The enhancements in Magiv2 allow for the generation of comprehensive manga transcripts with character names, ensuring a more consistent and coherent narrative.

Methodology

The proposed Magiv2 model comprises three main components:

  1. Detection and Association: Magiv2 improves upon prior models by detecting panels, text, characters, and speech bubble tails, an element indicative of a speaker's identity. This is achieved using a sophisticated graph-based approach with nodes representing bounding boxes and edges representing various associations (e.g., text-character, text-tail).
  2. Chapter-Wide Character Naming: A character bank termed "PopCharacters" is leveraged for consistent naming of characters across chapters. This bank consists of over 11K characters and associated exemplar images. The naming task is formulated as a constraint optimisation problem, ensuring must-link and cannot-link constraints across character occurrences in a chapter.
  3. Transcript Generation: By combining the outputs of detection and association with optical character recognition (OCR) results, Magiv2 produces a narrative text. Non-essential texts are filtered out to maintain narrative coherence.

Evaluation Metrics

The quality of the Magiv2 outputs was evaluated using several rigorous metrics:

  • Detection: Average precision for detecting characters, texts, panels, and tails.
  • Association: Metrics such as Adjusted Mutual Information (AMI), Normalized Mutual Information (NMI), precision, recall, and mean average precision (MAP) were utilized.
  • Text Classification: Differentiation between essential and non-essential text was validated using average precision.
  • Character Naming: Accuracy of character naming across chapters was assessed, comparing constraint optimization with traditional clustering approaches.

Results and Implications

Magiv2 demonstrated significant improvements over Magi and other baselines in several areas:

  1. Detection and Association:
    • Achieved higher average precision in detecting characters and texts.
    • Improved clustering and association performance, especially in speaker diarisation.
  2. Character Naming:
    • Constraint optimization approach notably outperformed traditional clustering methods, exhibiting robust capability in consistent character naming across chapters.

Contributions and Future Work

This paper extends the capabilities of manga transcript generation, especially for applications involving visually impaired individuals. The PopManga evaluation dataset was expanded with new annotations for speech-bubble tails and character identities. The introduction of the PopCharacters dataset enriches the resource pool for future research.

Future directions may include refining the per-page model to enhance overall performance, leveraging larger datasets for training to reduce noise, and exploring advanced models like vision-language fusion for better speaker attribution. Additionally, implementing LLMs can be considered for further improving OCR accuracy and transcript coherence.

Conclusion

The advancements presented in Magiv2 mark a significant step toward making manga more accessible through automated transcription. This work not only addresses prior limitations but also provides a framework for developing more refined models in the future, potentially extending its applicability beyond manga to other forms of visual storytelling media. The datasets introduced here can serve as a cornerstone for further research and innovations in this domain.

X Twitter Logo Streamline Icon: https://streamlinehq.com
Reddit Logo Streamline Icon: https://streamlinehq.com