Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
167 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

The Manga Whisperer: Automatically Generating Transcriptions for Comics (2401.10224v3)

Published 18 Jan 2024 in cs.CV

Abstract: In the past few decades, Japanese comics, commonly referred to as Manga, have transcended both cultural and linguistic boundaries to become a true worldwide sensation. Yet, the inherent reliance on visual cues and illustration within manga renders it largely inaccessible to individuals with visual impairments. In this work, we seek to address this substantial barrier, with the aim of ensuring that manga can be appreciated and actively engaged by everyone. Specifically, we tackle the problem of diarisation i.e. generating a transcription of who said what and when, in a fully automatic way. To this end, we make the following contributions: (1) we present a unified model, Magi, that is able to (a) detect panels, text boxes and character boxes, (b) cluster characters by identity (without knowing the number of clusters apriori), and (c) associate dialogues to their speakers; (2) we propose a novel approach that is able to sort the detected text boxes in their reading order and generate a dialogue transcript; (3) we annotate an evaluation benchmark for this task using publicly available [English] manga pages. The code, evaluation datasets and the pre-trained model can be found at: https://github.com/ragavsachdeva/magi.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (51)
  1. Building a manga dataset “manga109” with annotations for multimedia applications. IEEE MultiMedia, 27(2):8–18, 2020.
  2. Kosuke Akimoto. Danbooru 2020 zero-shot anime character identification dataset (zaci-20). https://github.com/kosuke1701/ZACI-20-dataset, 2021.
  3. Coo: Comic onomatopoeia dataset for recognizing arbitrary or truncated texts. In European Conference on Computer Vision, pages 267–283. Springer, 2022.
  4. Maciej Budyś. Manga ocr. https://github.com/kha-white/manga-ocr, 2022.
  5. End-to-end object detection with transformers. In European conference on computer vision, pages 213–229. Springer, 2020.
  6. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pages 1597–1607. PMLR, 2020.
  7. Des livres à voir et à toucher. Des livres à voir et à toucher. https://www.lavillebraille.fr/des-livres-a-voir-et-a-toucher/.
  8. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  9. The via annotation software for images, audio and video. In ACM MM, New York, USA, 2019. ACM, ACM. to appear in Proceedings of the 27th ACM International Conference on Multimedia (MM 19).
  10. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  11. Masked autoencoders are scalable vision learners. 2022 ieee. In CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15979–15988, 2021.
  12. An end-to-end quadrilateral regression network for comic panel extraction. In Proceedings of the 26th ACM international conference on Multimedia, pages 887–895, 2018.
  13. Towards fully automated manga translation. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 12998–13008, 2021.
  14. Cross-domain weakly-supervised object detection through progressive domain adaptation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5001–5009, 2018.
  15. Decoupled adaptation for cross-domain object detection. In International Conference on Learning Representations, 2022.
  16. Arthur B Kahn. Topological sorting of large networks. Communications of the ACM, 5(11):558–562, 1962.
  17. Supervised contrastive learning. Advances in neural information processing systems, 33:18661–18673, 2020.
  18. A layered method for determining manga text bubble reading order. In 2015 IEEE International Conference on Image Processing (ICIP), pages 4283–4287, 2015.
  19. Accesscomics: an accessible digital comic book reader for people with visual impairments. In Proceedings of the 18th International Web for All Conference, pages 1–11, 2021.
  20. Trocr: transformer-based optical character recognition with pre-trained models. arxiv 2021. arXiv preprint arXiv:2109.10282, 2021.
  21. Manga109dialog a large-scale dialogue dataset for comics speaker detection. arXiv preprint arXiv:2306.17469, 2023.
  22. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014.
  23. Little Prince in braille! Little prince in braille! https://www.blog.thelittleprince.com/little-prince-in-braille/.
  24. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2023.
  25. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  26. Visual relationship detection with language priors. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14, pages 852–869. Springer, 2016.
  27. MangaDex. Mangadex. https://mangadex.org/.
  28. MangaPlus. Mangaplus by shueisha. https://mangaplus.shueisha.co.jp/.
  29. Conditional detr for fast training convergence. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3651–3660, 2021.
  30. Philipp Meyer. Life - a tactical comic for the blind people. 2013.
  31. Pytorch metric learning. ArXiv, abs/2008.09164, 2020.
  32. MyAnimeList. Top manga - myanimelist.net. https://myanimelist.net/topmanga.php.
  33. Comic mtl: optimized multi-task learning for comic book image analysis. International Journal on Document Analysis and Recognition (IJDAR), 22:265–284, 2019.
  34. Object detection for comics using manga109 annotations. arXiv preprint arXiv:1803.08670, 2018.
  35. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023.
  36. A robust panel extraction method for manga. In Proceedings of the 22nd ACM international conference on Multimedia, pages 1125–1128, 2014.
  37. Detecting text in manga using stroke width transform. In 2019 11th International Conference on Knowledge and Smart Technology (KST), pages 142–147. IEEE, 2019.
  38. Progressive deep feature learning for manga character recognition via unlabeled training data. In Proceedings of the ACM Turing Celebration Conference-China, pages 1–6, 2019.
  39. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  40. Reshma Ramaprasad. Comics for everyone: Generating accessible text descriptions for comic strips. arXiv preprint arXiv:2310.00698, 2023.
  41. Speech balloon and speaker association for comics and manga understanding. In 2015 13th International Conference on Document Analysis and Recognition (ICDAR), pages 351–355. IEEE, 2015.
  42. Jean Serra. Image analysis and mathematical morphology. 1982.
  43. Relationformer: A unified framework for image-to-graph generation. In European Conference on Computer Vision, pages 422–439. Springer, 2022.
  44. Identity-aware semi-supervised learning for comic character re-identification. arXiv preprint arXiv:2308.09096, 2023.
  45. Star Wars. Star wars audio comics. https://www.youtube.com/@StarWarsAudioComics/.
  46. Domain-adaptive self-supervised pre-training for face & body detection in drawings. arXiv preprint arXiv:2211.10641, 2022.
  47. Adaptation of manga face representation for accurate clustering. In SIGGRAPH Asia 2018 Posters, pages 1–2. 2018.
  48. Comic frame extraction via line segments combination. In 2015 13th International Conference on Document Analysis and Recognition (ICDAR), pages 856–860. IEEE, 2015.
  49. Unsupervised manga character re-identification via face-body and spatial-temporal associated clustering. arXiv preprint arXiv:2204.04621, 2022.
  50. Objects as points. arXiv preprint arXiv:1904.07850, 2019.
  51. Deformable detr: Deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159, 2020.
Citations (8)

Summary

  • The paper introduces a unified model, Magi, which automatically transcribes manga by detecting panels and associating dialogues with characters.
  • It employs advanced machine learning techniques, including a CNN-backed transformer and topological sorting via a directed acyclic graph, to handle complex comic layouts.
  • Evaluation on the PopManga benchmark demonstrates strong accuracy in panel detection, character clustering, and speaker association, aiding accessibility for the visually impaired.

Overview of the Manga Whisperer

The "Manga Whisperer" is a name given to an innovative model called Magi, designed to automatically generate transcriptions for manga comics, making them more accessible to visually impaired individuals. Manga has seen a surge in global popularity, but its heavily visual medium presents a barrier to those who cannot experience the illustrations directly. The goal of the Manga Whisperer project is to remove this barrier by transcribing manga content into text, effectively narrating the visual story elements.

Addressing the Challenges

To accomplish its goal, the model faces numerous challenges such as detecting and ordering panels, recognizing characters in varying styles and poses, and associating dialogue with the correct speaker. The complexity of manga comics, with their unique layouts and often non-human characters, compounds these challenges. The Magi model confronts these issues using a combination of advanced machine learning techniques such as graph generation to detect and associate characters and text, and a CNN-backed transformer to process the manga page.

Technical Contributions

The paper's key contributions include developing the unified model, Magi, which boasts the capability to detect manga panels and texts, cluster character identities, and associate dialogues to their respective speakers. Additionally, a new sorting method for manga panels has been proposed, utilizing a directed acyclic graph (DAG) and Topological Sorting, which is more robust than previous approaches. To evaluate the model’s performance, the research introduced PopManga, a challenging benchmark dataset sourced from over 80 popular manga series.

Looking Forward

The paper's results are compelling, establishing Magi as a state-of-the-art model in the context of manga diarisation. The model demonstrates impressive accuracy in character detection, clustering, and speaker association. While the work contributes significantly toward making manga accessible to the visually impaired, it also opens up future research possibilities, such as combining this model with LLMs to enhance the narrative by taking into account conversational context as well as historical elements within the plot.

HackerNews