The Manga Whisperer: Automatically Generating Transcriptions for Comics (2401.10224v3)

Published 18 Jan 2024 in cs.CV

Abstract: In the past few decades, Japanese comics, commonly referred to as Manga, have transcended both cultural and linguistic boundaries to become a true worldwide sensation. Yet, the inherent reliance on visual cues and illustration within manga renders it largely inaccessible to individuals with visual impairments. In this work, we seek to address this substantial barrier, with the aim of ensuring that manga can be appreciated and actively engaged by everyone. Specifically, we tackle the problem of diarisation i.e. generating a transcription of who said what and when, in a fully automatic way. To this end, we make the following contributions: (1) we present a unified model, Magi, that is able to (a) detect panels, text boxes and character boxes, (b) cluster characters by identity (without knowing the number of clusters apriori), and (c) associate dialogues to their speakers; (2) we propose a novel approach that is able to sort the detected text boxes in their reading order and generate a dialogue transcript; (3) we annotate an evaluation benchmark for this task using publicly available [English] manga pages. The code, evaluation datasets and the pre-trained model can be found at: https://github.com/ragavsachdeva/magi.

References (51)

Citations (8)

View on Semantic Scholar

Summary

The paper introduces a unified model, Magi, which automatically transcribes manga by detecting panels and associating dialogues with characters.
It employs advanced machine learning techniques, including a CNN-backed transformer and topological sorting via a directed acyclic graph, to handle complex comic layouts.
Evaluation on the PopManga benchmark demonstrates strong accuracy in panel detection, character clustering, and speaker association, aiding accessibility for the visually impaired.

Overview of the Manga Whisperer

The "Manga Whisperer" is a name given to an innovative model called Magi, designed to automatically generate transcriptions for manga comics, making them more accessible to visually impaired individuals. Manga has seen a surge in global popularity, but its heavily visual medium presents a barrier to those who cannot experience the illustrations directly. The goal of the Manga Whisperer project is to remove this barrier by transcribing manga content into text, effectively narrating the visual story elements.

Addressing the Challenges

To accomplish its goal, the model faces numerous challenges such as detecting and ordering panels, recognizing characters in varying styles and poses, and associating dialogue with the correct speaker. The complexity of manga comics, with their unique layouts and often non-human characters, compounds these challenges. The Magi model confronts these issues using a combination of advanced machine learning techniques such as graph generation to detect and associate characters and text, and a CNN-backed transformer to process the manga page.

Technical Contributions

The paper's key contributions include developing the unified model, Magi, which boasts the capability to detect manga panels and texts, cluster character identities, and associate dialogues to their respective speakers. Additionally, a new sorting method for manga panels has been proposed, utilizing a directed acyclic graph (DAG) and Topological Sorting, which is more robust than previous approaches. To evaluate the model’s performance, the research introduced PopManga, a challenging benchmark dataset sourced from over 80 popular manga series.

Looking Forward

The paper's results are compelling, establishing Magi as a state-of-the-art model in the context of manga diarisation. The model demonstrates impressive accuracy in character detection, clustering, and speaker association. While the work contributes significantly toward making manga accessible to the visually impaired, it also opens up future research possibilities, such as combining this model with LLMs to enhance the narrative by taking into account conversational context as well as historical elements within the plot.

PDF Markdown

Related Papers

GitHub

GitHub - ragavsachdeva/magi: Generate a transcript for your favourite Manga: Detect manga characters, text blocks and panels. Order panels. Cluster characters. Match texts to their speakers. Perform OCR. (207 stars)

Tweets

https://twitter.com/RagavSachdeva/status/1820043540053012538

https://twitter.com/RagavSachdeva/status/1749430548810903826

https://twitter.com/susumuota/status/1749221780483457398