Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
134 tokens/sec
GPT-4o
10 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Manga109Dialog: A Large-scale Dialogue Dataset for Comics Speaker Detection (2306.17469v2)

Published 30 Jun 2023 in cs.CV

Abstract: The expanding market for e-comics has spurred interest in the development of automated methods to analyze comics. For further understanding of comics, an automated approach is needed to link text in comics to characters speaking the words. Comics speaker detection research has practical applications, such as automatic character assignment for audiobooks, automatic translation according to characters' personalities, and inference of character relationships and stories. To deal with the problem of insufficient speaker-to-text annotations, we created a new annotation dataset Manga109Dialog based on Manga109. Manga109Dialog is the world's largest comics speaker annotation dataset, containing 132,692 speaker-to-text pairs. We further divided our dataset into different levels by prediction difficulties to evaluate speaker detection methods more appropriately. Unlike existing methods mainly based on distances, we propose a deep learning-based method using scene graph generation models. Due to the unique features of comics, we enhance the performance of our proposed model by considering the frame reading order. We conducted experiments using Manga109Dialog and other datasets. Experimental results demonstrate that our scene-graph-based approach outperforms existing methods, achieving a prediction accuracy of over 75%.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (22)
  1. Construction and difficulties of comic text-to-speaker mapping datasets (in japanese). In Proceedings of the 3rd Special Interest Group on Comic Computing, pages 7–12, 2020.
  2. Building a manga dataset “manga109” with annotations for multimedia applications. IEEE MultiMedia, 27(2):8–18, 2020.
  3. ebdtheque: a representative database of comics. In Proceedings of 12th International Conference on Document Analysis and Recognition, pages 1145–1149. IEEE, 2013.
  4. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
  5. Hikaru Ikuta. Comic panel order estimator. https://github.com/manga109/panel-order-estimator, 2023.
  6. Image retrieval using scene graphs. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3668–3678, 2015.
  7. A layered method for determining manga text bubble reading order. In Proceedings of the IEEE International Conference on Image Processing (ICIP), pages 4283–4287. IEEE, 2015.
  8. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision, 123(1):32–73, 2017.
  9. Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2117–2125, 2017.
  10. Visual relationship detection with language priors. In Proceedings of the European conference on computer vision (ECCV), pages 852–869. Springer, 2016.
  11. Estimation of comic speakers based on the chronological order of text position and meaning (in japanese). In Proceedings of Multimedia, Distributed, Cooperative, and Mobile Symposium (DICOMO), pages 1291–1297, 2019.
  12. Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems, 28, 2015.
  13. Speech balloon and speaker association for comics and manga understanding. In Proceedings of 13th International Conference on Document Analysis and Recognition (ICDAR), pages 351–355. IEEE, 2015.
  14. Kyohei Shibata. Comic market 2023 (in japanese). In A monthly report of publications. Research Institute for Publications, 2023.
  15. Energy-based learning for scene graph generation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13936–13945, 2021.
  16. Unbiased scene graph generation from biased training. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3716–3725, 2020.
  17. Learning to compose dynamic tree structures for visual contexts. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6619–6628, 2019.
  18. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1492–1500, 2017.
  19. Automatic speaker estimation in comics considering text contents (in japanese). In Proceedings of Meeting on Image Recognition and Understanding (MIRU), 2021.
  20. Speech balloon and speaker association by data driven approach (in japanese). ITE technical report, 117(431):287–292, 2018.
  21. Exploring visual relationship for image captioning. In Proceedings of the European conference on computer vision (ECCV), pages 684–699, 2018.
  22. Neural motifs: Scene graph parsing with global context. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5831–5840, 2018.
Citations (11)

Summary

We haven't generated a summary for this paper yet.

Github Logo Streamline Icon: https://streamlinehq.com
X Twitter Logo Streamline Icon: https://streamlinehq.com