Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

JDocQA: Japanese Document Question Answering Dataset for Generative Language Models (2403.19454v1)

Published 28 Mar 2024 in cs.CL

Abstract: Document question answering is a task of question answering on given documents such as reports, slides, pamphlets, and websites, and it is a truly demanding task as paper and electronic forms of documents are so common in our society. This is known as a quite challenging task because it requires not only text understanding but also understanding of figures and tables, and hence visual question answering (VQA) methods are often examined in addition to textual approaches. We introduce Japanese Document Question Answering (JDocQA), a large-scale document-based QA dataset, essentially requiring both visual and textual information to answer questions, which comprises 5,504 documents in PDF format and annotated 11,600 question-and-answer instances in Japanese. Each QA instance includes references to the document pages and bounding boxes for the answer clues. We incorporate multiple categories of questions and unanswerable questions from the document for realistic question-answering applications. We empirically evaluate the effectiveness of our dataset with text-based LLMs and multimodal models. Incorporating unanswerable questions in finetuning may contribute to harnessing the so-called hallucination generation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (42)
  1. VQA: Visual Question Answering. In International Conference on Computer Vision (ICCV).
  2. Visual question answering on image sets. In ECCV 2020.
  3. Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc.
  4. WebSRC: A dataset for web-based structural reading comprehension. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 4173–4185, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  5. Instructblip: Towards general-purpose vision-language models with instruction tuning.
  6. DROP: A reading comprehension benchmark requiring discrete reasoning over paragraphs. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2368–2378, Minneapolis, Minnesota. Association for Computational Linguistics.
  7. The amazing mysteries of the gutter: Drawing inferences between panels in comic book narratives. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 6478–6487.
  8. Are you smarter than a sixth grader? textbook question answering for multimodal machine comprehension. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  9. JGLUE: Japanese general language understanding evaluation. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 2957–2966, Marseille, France. European Language Resources Association.
  10. Blip-diffusion: Pre-trained subject representation for controllable text-to-image generation and editing. arXiv preprint arXiv:2305.14720.
  11. Lavis: A one-stop library for language-vision intelligence. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations), pages 31–41.
  12. BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In ICML.
  13. Mateusz Malinowski and Mario Fritz. 2014. A multi-world approach to question answering about real-world scenes based on uncertain input. In Advances in Neural Information Processing Systems, volume 27. Curran Associates, Inc.
  14. Infographicvqa. In 2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 2582–2591.
  15. Docvqa: A dataset for vqa on document images. In Proceedings - 2021 IEEE Winter Conference on Applications of Computer Vision, WACV 2021, pages 2199–2208.
  16. DocVQA: A dataset for VQA on document images, Proceedings - 2021 IEEE Winter Conference on Applications of Computer Vision, WACV 2021, pages 2199–2208. Institute of Electrical and Electronics Engineers Inc. Funding Information: We thank Amazon for supporting the annotation effort, and Dr. R. Manmatha for many useful discussions and inputs. This work is partly supported by MeitY, Government of India, the project TIN2017-89779-P, an Amazon AWS Research Award and the CERCA Programme. Publisher Copyright: © 2021 IEEE.
  17. Ocr-vqa: Visual question answering by reading text in images. In ICDAR.
  18. Cross-task generalization via natural language crowdsourcing instructions. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3470–3487, Dublin, Ireland. Association for Computational Linguistics.
  19. University entrance examinations as a benchmark resource for NLP-based problem solving. In Proceedings of the Sixth International Joint Conference on Natural Language Processing, pages 1357–1365, Nagoya, Japan. Asian Federation of Natural Language Processing.
  20. Takashi Miyazaki and Nobuyuki Shimizu. 2016. Cross-lingual image caption generation. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1780–1790, Berlin, Germany. Association for Computational Linguistics.
  21. OpenAI. 2023. GPT-4 technical report. Technical report.
  22. Training language models to follow instructions with human feedback. In Advances in Neural Information Processing Systems, volume 35, pages 27730–27744. Curran Associates, Inc.
  23. Going full-tilt boogie on document understanding with text-image-layout transformer. In Document Analysis and Recognition–ICDAR 2021: 16th International Conference, Lausanne, Switzerland, September 5–10, 2021, Proceedings, Part II 16, pages 732–747. Springer.
  24. Know what you don’t know: Unanswerable questions for SQuAD. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 784–789, Melbourne, Australia. Association for Computational Linguistics.
  25. SQuAD: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2383–2392, Austin, Texas. Association for Computational Linguistics.
  26. Multitask prompted training enables zero-shot task generalization. In International Conference on Learning Representations.
  27. Towards vqa models that can read. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  28. Pubtables-1m: Towards comprehensive table extraction from unstructured documents. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4634–4642.
  29. Machine comprehension improves domain-specific Japanese predicate-argument structure analysis. In Proceedings of the 2nd Workshop on Machine Reading for Question Answering, pages 98–104, Hong Kong, China. Association for Computational Linguistics.
  30. Multimodalqa: complex question answering over text, tables and images. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net.
  31. Slidevqa: A dataset for document visual question answering on multiple images. In AAAI.
  32. Visualmrc: Machine reading comprehension on document images. In AAAI.
  33. Unifying vision, text, and layout for universal document processing. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 19254–19264.
  34. MovieQA: Understanding Stories in Movies through Question-Answering. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  35. Document collection visual question answering. In IEEE International Conference on Document Analysis and Recognition.
  36. Hierarchical multimodal transformers for multi-page docvqa. ArXiv, abs/2212.05935.
  37. Finetuned language models are zero-shot learners. In International Conference on Learning Representations.
  38. LayoutLMv2: Multi-modal pre-training for visually-rich document understanding. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 2579–2591, Online. Association for Computational Linguistics.
  39. Layoutlm: Pre-training of text and layout for document image understanding. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD ’20, page 1192–1200, New York, NY, USA. Association for Computing Machinery.
  40. RecipeQA: A challenge dataset for multimodal comprehension of cooking recipes. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 1358–1368, Brussels, Belgium. Association for Computational Linguistics.
  41. Hitomi Yanaka and Koji Mineshima. 2022. Compositional evaluation on Japanese textual entailment and similarity. Transactions of the Association for Computational Linguistics, 10:1266–1284.
  42. HotpotQA: A dataset for diverse, explainable multi-hop question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2369–2380, Brussels, Belgium. Association for Computational Linguistics.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Eri Onami (3 papers)
  2. Shuhei Kurita (21 papers)
  3. Taiki Miyanishi (9 papers)
  4. Taro Watanabe (76 papers)
Citations (1)
X Twitter Logo Streamline Icon: https://streamlinehq.com