Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
126 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Zero-Shot Character Identification and Speaker Prediction in Comics via Iterative Multimodal Fusion (2404.13993v4)

Published 22 Apr 2024 in cs.MM and cs.CV

Abstract: Recognizing characters and predicting speakers of dialogue are critical for comic processing tasks, such as voice generation or translation. However, because characters vary by comic title, supervised learning approaches like training character classifiers which require specific annotations for each comic title are infeasible. This motivates us to propose a novel zero-shot approach, allowing machines to identify characters and predict speaker names based solely on unannotated comic images. In spite of their importance in real-world applications, these task have largely remained unexplored due to challenges in story comprehension and multimodal integration. Recent LLMs have shown great capability for text understanding and reasoning, while their application to multimodal content analysis is still an open problem. To address this problem, we propose an iterative multimodal framework, the first to employ multimodal information for both character identification and speaker prediction tasks. Our experiments demonstrate the effectiveness of the proposed framework, establishing a robust baseline for these tasks. Furthermore, since our method requires no training data or annotations, it can be used as-is on any comic series.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (30)
  1. Building a manga dataset “manga109” with annotations for multimedia applications. IEEE MultiMedia 27, 2 (2020), 8–18.
  2. Danbooru2021: A Large-Scale Crowdsourced & Tagged Anime Illustration Dataset. https://gwern.net/danbooru2021.
  3. Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE conference on computer vision and pattern recognition. 248–255.
  4. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770–778.
  5. Towards fully automated manga translation. In Proceedings of the AAAI conference on artificial intelligence, Vol. 35. 12998–13008.
  6. Jie Huang and Kevin Chen-Chuan Chang. 2023. Towards reasoning in large language models: A survey. In Proceedings of the 61th annual meeting of the association for computational linguistics. 1049–1065.
  7. Image retrieval using scene graphs. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3668–3678.
  8. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, Vol. 25.
  9. Manga109Dialog A Large-scale Dialogue Dataset for Comics Speaker Detection. arXiv:2306.17469
  10. Improved baselines with visual instruction tuning. arXiv:2310.03744
  11. Visual Instruction Tuning. arXiv:2304.08485
  12. A ConvNet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
  13. Stuart Lloyd. 1982. Least squares quantization in PCM. IEEE transactions on information theory 28, 2 (1982), 129–137.
  14. Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization. arXiv:1711.05101
  15. Learning with noisy labels. In Advances in neural information processing systems, Vol. 26.
  16. Object detection for comics using manga109 annotations. arXiv:1803.08670
  17. OpenAI. 2022. ChatGPT. https://openai.com/blog/chatgpt/.
  18. OpenAI. 2023. GPT-4 Technical Report. arXiv:2303.08774
  19. Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems 28 (2015).
  20. Speech balloon and speaker association for comics and manga understanding. In Proceedings of 13th international conference on document analysis and recognition. IEEE, 351–355.
  21. Learning from noisy labels with deep neural networks: A survey. IEEE transactions on neural networks and learning systems (2022).
  22. Unbiased scene graph generation from biased training. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3716–3725.
  23. Adaptation of Manga Face Representation for Accurate Clustering. In SIGGRAPH Asia 2018 Posters. Association for Computing Machinery, Article 15, 2 pages.
  24. Emergent abilities of large language models. Transactions on Machine Learning Research (2022).
  25. Scene graph generation by iterative message passing. In Proceedings of the IEEE conference on computer vision and pattern recognition. 5410–5419.
  26. Neural motifs: Scene graph parsing with global context. In Proceedings of the IEEE conference on computer vision and pattern recognition. 5831–5840.
  27. Benchmarking Large Language Models for News Summarization. arXiv:2301.13848
  28. Unsupervised Manga Character Re-identification via Face-body and Spatial-temporal Associated Clustering. arXiv:2204.04621
  29. Cartoon face recognition: A benchmark dataset. In Proceedings of the 28th ACM international conference on multimedia. 2264–2272.
  30. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv:2304.10592
Citations (4)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com
Reddit Logo Streamline Icon: https://streamlinehq.com