Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Read and Think: An Efficient Step-wise Multimodal Language Model for Document Understanding and Reasoning (2403.00816v3)

Published 26 Feb 2024 in cs.IR, cs.AI, and cs.CV

Abstract: Understanding the contents of multimodal documents is essential to accurately extract relevant evidence and use it for reasoning. Existing document understanding models tend to generate answers with a single word or phrase directly, ignoring the source document's evidence and lacking interpretability. In this work, we address the lack of step-wise capabilities through data augmentation and extension. Specifically, We use Multi-modal LLMs (MLLMs), which have strong visual understanding and reasoning abilities, as data generators to generate step-wise question-and-answer pairs for document images and use a high-performance LLM as the error detector to filter out noisy data. This step-wise data generation pipeline is implemented using both template-based and few-shot methods. We then use the generated high-quality data to train a humanized document understanding and reasoning model, specifically designed to solve complex questions that require reasoning or multi-hop question answering, dubbed DocAssistant. Experimental results demonstrate the effectiveness and application value of step-wise generation, showing a 5 improvement on InfoVQA with complex layouts and a 7 improvement on ChartQA with complex reasoning, compared to directly generated answers. We hope our work highlights the potential of synthetic data and encourages further exploration of multi-modal document reasoning capabilities.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (37)
  1. Docformerv2: Local features for document understanding. arXiv preprint arXiv:2306.01733.
  2. Scene text visual question answering. In Proceedings of the IEEE/CVF international conference on computer vision, pages 4291–4301.
  3. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  4. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113.
  5. Llama-adapter v2: Parameter-efficient visual instruction model. arXiv preprint arXiv:2304.15010.
  6. Retrieval augmented language model pre-training. In International conference on machine learning, pages 3929–3938. PMLR.
  7. Structured pruning adapters. arXiv preprint arXiv:2211.10155.
  8. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685.
  9. Layoutlmv3: Pre-training for document ai with unified text and image masking. In Proceedings of the 30th ACM International Conference on Multimedia, pages 4083–4091.
  10. Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of naacL-HLT, volume 1, page 2.
  11. Ocr-free document understanding transformer. In European Conference on Computer Vision, pages 498–517. Springer.
  12. Pix2struct: Screenshot parsing as pretraining for visual language understanding. In International Conference on Machine Learning, pages 18893–18912. PMLR.
  13. The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691.
  14. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459–9474.
  15. Xiang Lisa Li and Percy Liang. 2021. Prefix-tuning: Optimizing continuous prompts for generation. arXiv preprint arXiv:2101.00190.
  16. P-tuning v2: Prompt tuning can be comparable to fine-tuning universally across scales and tasks. arXiv preprint arXiv:2110.07602.
  17. Sail: Search-augmented instruction learning. arXiv preprint arXiv:2305.15225.
  18. Infographicvqa. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1697–1706.
  19. Docvqa: A dataset for vqa on document images. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 2200–2209.
  20. Rankvicuna: Zero-shot listwise document reranking with open-source large language models. arXiv preprint arXiv: 2309.15088.
  21. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551.
  22. In-context retrieval-augmented language models. arXiv preprint arXiv:2302.00083.
  23. Visualmrc: Machine reading comprehension on document images. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 13878–13888.
  24. Unifying vision, text, and layout for universal document processing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19254–19264.
  25. Hierarchical multimodal transformers for multipage docvqa. Pattern Recognition, 144:109834.
  26. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
  27. Document understanding dataset and evaluation (dude). In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 19528–19540.
  28. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4566–4575.
  29. Cogvlm: Visual expert for pretrained language models. arXiv preprint arXiv:2311.03079.
  30. Layoutlm: Pre-training of text and layout for document image understanding. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 1192–1200.
  31. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. arXiv preprint arXiv:1809.09600.
  32. mplug-docowl: Modularized multimodal large language model for document understanding. arXiv preprint arXiv:2307.02499.
  33. Ureader: Universal ocr-free visually-situated language understanding with multimodal large language model. arXiv preprint arXiv:2310.05126.
  34. Structextv2: Masked visual-textual prediction for document image pre-training. arXiv preprint arXiv:2303.00289.
  35. Adaptive budget allocation for parameter-efficient fine-tuning. arXiv preprint arXiv:2303.10512.
  36. Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:2303.16199.
  37. Llavar: Enhanced visual instruction tuning for text-rich image understanding. arXiv preprint arXiv:2306.17107.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (1)
  1. Jinxu Zhang (2 papers)