HRVDA: High-Resolution Visual Document Assistant (2404.06918v1)
Abstract: Leveraging vast training data, multimodal LLMs (MLLMs) have demonstrated formidable general visual comprehension capabilities and achieved remarkable performance across various tasks. However, their performance in visual document understanding still leaves much room for improvement. This discrepancy is primarily attributed to the fact that visual document understanding is a fine-grained prediction task. In natural scenes, MLLMs typically use low-resolution images, leading to a substantial loss of visual information. Furthermore, general-purpose MLLMs do not excel in handling document-oriented instructions. In this paper, we propose a High-Resolution Visual Document Assistant (HRVDA), which bridges the gap between MLLMs and visual document understanding. This model employs a content filtering mechanism and an instruction filtering module to separately filter out the content-agnostic visual tokens and instruction-agnostic visual tokens, thereby achieving efficient model training and inference for high-resolution images. In addition, we construct a document-oriented visual instruction tuning dataset and apply a multi-stage training strategy to enhance the model's document modeling capabilities. Extensive experiments demonstrate that our model achieves state-of-the-art performance across multiple document understanding datasets, while maintaining training efficiency and inference speed comparable to low-resolution models.
- Flamingo: a visual language model for few-shot learning. In NeurIPS, 2022.
- Docformer: End-to-end transformer for document understanding. In 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021, pages 973–983. IEEE, 2021.
- Wukong-reader: Multi-modal pre-training for fine-grained visual document understanding. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, pages 13386–13401. Association for Computational Linguistics, 2023a.
- Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966, 2023b.
- Explaining queries over web tables to non-experts. In 35th IEEE International Conference on Data Engineering, ICDE 2019, Macao, China, April 8-11, 2019, pages 1570–1573. IEEE, 2019.
- Token merging: Your vit but faster. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023.
- DUE: End-to-end document understanding benchmark. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021.
- Query-driven generative network for document information extraction in the wild. In MM ’22: The 30th ACM International Conference on Multimedia, Lisboa, Portugal, October 10 - 14, 2022, pages 4261–4271. ACM, 2022.
- Attention where it matters: Rethinking visual document understanding with selective region concentration. CoRR, abs/2309.01131, 2023.
- End-to-end object detection with transformers. In Computer Vision - ECCV 2020 - 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part I, pages 213–229. Springer, 2020.
- Making vision transformers efficient from A token sparsification view. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, pages 6195–6205. IEEE, 2023a.
- Making vision transformers efficient from A token sparsification view. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, pages 6195–6205. IEEE, 2023b.
- X-LLM: bootstrapping advanced large language models by treating multi-modalities as foreign languages. CoRR, abs/2305.04160, 2023a.
- Minigpt-v2: Large language model as a unified interface for vision-language multi-task learning. arXiv preprint arXiv:2310.09478, 2023b.
- Tabfact: A large-scale dataset for table-based fact verification. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020a.
- Tabfact: A large-scale dataset for table-based fact verification. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020b.
- Sparsevit: Revisiting activation sparsity for efficient high-resolution vision transformer. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, pages 2061–2070. IEEE, 2023c.
- Pali-3 vision language models: Smaller, faster, stronger. CoRR, abs/2310.09199, 2023d.
- Document AI: benchmarks, models and applications. CoRR, abs/2111.08609, 2021.
- Instructblip: Towards general-purpose vision-language models with instruction tuning. CoRR, abs/2305.06500, 2023.
- End-to-end document recognition and understanding with dessurt. In Computer Vision - ECCV 2022 Workshops - Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part IV, pages 280–296. Springer, 2022.
- V-doc : Visual questions answers with documents. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pages 21460–21466. IEEE, 2022.
- An image is worth 16x16 words: Transformers for image recognition at scale. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021.
- Llama-adapter V2: parameter-efficient visual instruction model. CoRR, abs/2304.15010, 2023.
- Xylayoutlm: Towards layout-aware multimodal networks for visually-rich document understanding. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pages 4573–4582. IEEE, 2022.
- Synthetic data for text localisation in natural images. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pages 2315–2324. IEEE Computer Society, 2016.
- Evaluation of deep convolutional nets for document image classification and retrieval. In 13th International Conference on Document Analysis and Recognition, ICDAR 2015, Nancy, France, August 23-26, 2015, pages 991–995. IEEE Computer Society, 2015.
- Lora: Low-rank adaptation of large language models. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022.
- Layoutlmv3: Pre-training for document AI with unified text and image masking. In MM ’22: The 30th ACM International Conference on Multimedia, Lisboa, Portugal, October 10 - 14, 2022, pages 4083–4091. ACM, 2022.
- ICDAR2019 competition on scanned receipt OCR and information extraction. In 2019 International Conference on Document Analysis and Recognition, ICDAR 2019, Sydney, Australia, September 20-25, 2019, pages 1516–1520. IEEE, 2019.
- Synthetic data and artificial neural networks for natural scene text recognition. CoRR, abs/1406.2227, 2014.
- Chargrid: Towards understanding 2d documents. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018, pages 4459–4469. Association for Computational Linguistics, 2018.
- Ocr-free document understanding transformer. In Computer Vision - ECCV 2022 - 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part XXVIII, pages 498–517. Springer, 2022a.
- Learned token pruning for transformers. In KDD ’22: The 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Washington, DC, USA, August 14 - 18, 2022, pages 784–794. ACM, 2022b.
- Pix2struct: Screenshot parsing as pretraining for visual language understanding. In International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, pages 18893–18912. PMLR, 2023.
- BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, pages 12888–12900. PMLR, 2022.
- BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, pages 19730–19742. PMLR, 2023a.
- Monkey: Image resolution and text label are important things for large multi-modal models. CoRR, abs/2311.06607, 2023b.
- Improved baselines with visual instruction tuning, 2023a.
- Visual instruction tuning. CoRR, abs/2304.08485, 2023b.
- Multilingual denoising pre-training for neural machine translation. Trans. Assoc. Comput. Linguistics, 8:726–742, 2020.
- Revisiting token pruning for object detection and instance segmentation. CoRR, abs/2306.07050, 2023c.
- Swin transformer: Hierarchical vision transformer using shifted windows. In 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021, pages 9992–10002. IEEE, 2021.
- Cheap and quick: Efficient vision-language instruction tuning for large language models. CoRR, abs/2305.15023, 2023.
- Docvqa: A dataset for VQA on document images. In IEEE Winter Conference on Applications of Computer Vision, WACV 2021, Waikoloa, HI, USA, January 3-8, 2021, pages 2199–2208. IEEE, 2021.
- Infographicvqa. In IEEE/CVF Winter Conference on Applications of Computer Vision, WACV 2022, Waikoloa, HI, USA, January 3-8, 2022, pages 2582–2591. IEEE, 2022.
- OpenAI. ChatGPT. https://openai.com/blog/chatgpt/, 2023a.
- OpenAI. GPT-4 technical report. CoRR, abs/2303.08774, 2023b.
- Cord: A consolidated receipt dataset for post-ocr parsing. 2019.
- Going full-tilt boogie on document understanding with text-image-layout transformer. In 16th International Conference on Document Analysis and Recognition, ICDAR 2021, Lausanne, Switzerland, September 5-10, 2021, Proceedings, Part II, pages 732–747. Springer, 2021.
- Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, pages 8748–8763. PMLR, 2021.
- Dynamicvit: Efficient vision transformers with dynamic token sparsification. In Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pages 13937–13949, 2021.
- Shunted self-attention via multi-scale token aggregation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pages 10843–10852. IEEE, 2022.
- U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention - MICCAI 2015 - 18th International Conference Munich, Germany, October 5 - 9, 2015, Proceedings, Part III, pages 234–241. Springer, 2015.
- Textcaps: A dataset for image captioning with reading comprehension. In Computer Vision - ECCV 2020 - 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part II, pages 742–758. Springer, 2020.
- Towards VQA models that can read. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pages 8317–8326. Computer Vision Foundation / IEEE, 2019.
- Kleister: Key information extraction datasets involving long documents with complex layouts. In 16th International Conference on Document Analysis and Recognition, ICDAR 2021, Lausanne, Switzerland, September 5-10, 2021, Proceedings, Part I, pages 564–579. Springer, 2021.
- Pandagpt: One model to instruction-follow them all. CoRR, abs/2305.16355, 2023.
- Visualmrc: Machine reading comprehension on document images. In Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Thirty-Third Conference on Innovative Applications of Artificial Intelligence, IAAI 2021, The Eleventh Symposium on Educational Advances in Artificial Intelligence, EAAI 2021, Virtual Event, February 2-9, 2021, pages 13878–13888. AAAI Press, 2021.
- Unifying vision, text, and layout for universal document processing. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, pages 19254–19264. IEEE, 2023.
- Llama: Open and efficient foundation language models. CoRR, abs/2302.13971, 2023a.
- Llama 2: Open foundation and fine-tuned chat models. CoRR, abs/2307.09288, 2023b.
- Attention is all you need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pages 5998–6008, 2017.
- Shape robust text detection with progressive scale expansion network. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pages 9336–9345. Computer Vision Foundation / IEEE, 2019.
- Cogvlm: Visual expert for pretrained language models, 2023.
- Layoutlm: Pre-training of text and layout for document image understanding. In KDD ’20: The 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Virtual Event, CA, USA, August 23-27, 2020, pages 1192–1200. ACM, 2020.
- Layoutlmv2: Multi-modal pre-training for visually-rich document understanding. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, August 1-6, 2021, pages 2579–2591. Association for Computational Linguistics, 2021.
- mplug-docowl: Modularized multimodal large language model for document understanding. CoRR, abs/2307.02499, 2023a.
- Ureader: Universal ocr-free visually-situated language understanding with multimodal large language model. CoRR, abs/2310.05126, 2023b.
- mplug-owl: Modularization empowers large language models with multimodality. CoRR, abs/2304.14178, 2023c.
- A survey on multimodal large language models. CoRR, abs/2306.13549, 2023.
- PICK: processing key information extraction from documents using improved graph learning-convolutional networks. In 25th International Conference on Pattern Recognition, ICPR 2020, Virtual Event / Milan, Italy, January 10-15, 2021, pages 4363–4370. IEEE, 2020.
- Video-llama: An instruction-tuned audio-visual language model for video understanding. CoRR, abs/2306.02858, 2023a.
- Llama-adapter: Efficient fine-tuning of language models with zero-init attention. CoRR, abs/2303.16199, 2023b.
- PMC-VQA: visual instruction tuning for medical visual question answering. CoRR, abs/2305.10415, 2023c.
- Image-based table recognition: Data, model, and evaluation. In Computer Vision - ECCV 2020 - 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part XXI, pages 564–580. Springer, 2020.
- Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR, abs/2304.10592, 2023.
- Chaohu Liu (5 papers)
- Kun Yin (5 papers)
- Haoyu Cao (12 papers)
- Xinghua Jiang (8 papers)
- Xin Li (980 papers)
- Yinsong Liu (8 papers)
- Deqiang Jiang (20 papers)
- Xing Sun (93 papers)
- Linli Xu (33 papers)