DocPedia: Unleashing the Power of Large Multimodal Model in the Frequency Domain for Versatile Document Understanding (2311.11810v4)
Abstract: This work presents DocPedia, a novel large multimodal model (LMM) for versatile OCR-free document understanding, capable of parsing images up to 2,560$\times$2,560 resolution. Unlike existing work either struggle with high-resolution documents or give up the LLM thus vision or language ability constrained, our DocPedia directly processes visual input in the frequency domain rather than the pixel space. The unique characteristic enables DocPedia to capture a greater amount of visual and textual information using a limited number of visual tokens. To consistently enhance both perception and comprehension abilities of our model, we develop a dual-stage training strategy and enrich instructions/annotations of all training tasks covering multiple document types. Extensive quantitative and qualitative experiments conducted on various publicly available benchmarks confirm the mutual benefits of jointly learning perception and comprehension tasks. The results provide further evidence of the effectiveness and superior performance of our DocPedia over other methods.
- Discrete cosine transform. IEEE Transactions on Computers, 100(1):90–93, 1974.
- DocFormer: End-to-end transformer for document understanding. In ICCV, pages 993–1003, 2021.
- Wukong-Reader: Multi-modal pre-training for fine-grained visual document understanding. arXiv preprint arXiv:2212.09621, 2022.
- Qwen-VL: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966, 2023.
- ICDAR 2019 competition on scene text visual question answering. In ICDAR, pages 1563–1570, 2019.
- Language models are few-shot learners. NeurIPS, 33:1877–1901, 2020.
- Vicuna: An open-source chatbot impressing GPT-4 with 90%* ChatGPT quality. See https://vicuna. lmsys. org, 2023.
- BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- UniDoc: A universal large multimodal model for simultaneous text detection, recognition, spotting and understanding. arXiv preprint arXiv:2308.11592, 2023.
- Masked autoencoders are scalable vision learners. In CVPR, pages 16000–16009, 2022.
- Bros: A pre-trained language model focusing on text and layout for better key information extraction from documents. In AAAI, pages 10767–10775, 2022.
- A comprehensive survey of deep learning for image captioning. ACM Computing Surveys, 51(6):1–36, 2019.
- LayoutLMv3: Pre-training for document ai with unified text and image masking. In ACM MM, pages 4083–4091, 2022.
- ICDAR 2019 competition on scanned receipt OCR and information extraction. In ICDAR, pages 1516–1520, 2019.
- Post-OCR parsing: building simple and robust parser via bio tagging. In NeurIPS, 2019.
- Funsd: A dataset for form understanding in noisy scanned documents. In ICDARW, pages 1–6, 2019.
- FigureQA: An annotated figure dataset for visual reasoning. arXiv preprint arXiv:1710.07300, 2017.
- OCR-free document understanding transformer. In ECCV, pages 498–517, 2022.
- Visual information extraction in the wild: practical dataset and end-to-end solution. In ICDAR, pages 36–53, 2023.
- Pix2Struct: Screenshot parsing as pretraining for visual language understanding. In ICML, pages 18893–18912, 2023.
- StrucTexT: Structured text understanding with multi-modal transformers. In ACM MM, pages 1912–1920, 2021.
- Real-time scene text detection with differentiable binarization. In AAAI, pages 11474–11481, 2020.
- NomMer: Nominate synergistic context in vision transformer for visual recognition. In CVPR, pages 12073–12082, 2022.
- The devil is in the frequency: Geminated gestalt autoencoder for self-supervised visual pre-training. In AAAI, pages 1649–1656, 2023a.
- Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023b.
- Fots: Fast oriented text spotting with a unified network. In CVPR, pages 5676–5685, 2018.
- On the hidden mystery of OCR in large multimodal models. arXiv preprint arXiv:2305.07895, 2023c.
- Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV, pages 10012–10022, 2021.
- Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
- GeoLayoutLM: Geometric pre-training for visual information extraction. In CVPR, pages 7092–7101, 2023.
- KOSMOS-2.5: A multimodal literate model. arXiv preprint arXiv:2309.11419, 2023.
- ChartQA: A benchmark for question answering about charts with visual and logical reasoning. arXiv preprint arXiv:2203.10244, 2022.
- DocVQA: A dataset for VQA on document images. In WACV, pages 2200–2209, 2021.
- InfographicVQA. In WACV, pages 1697–1706, 2022.
- OCR-VQA: Visual question answering by reading text in images. In ICDAR, pages 947–952, 2019.
- Ernie-layout: Layout knowledge enhanced pre-training for visually-rich document understanding. arXiv preprint arXiv:2210.06155, 2022.
- KOSMOS-2: Grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824, 2023.
- Learning transferable visual models from natural language supervision. In ICML, pages 8748–8763, 2021.
- An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. TPAMI, 39(11):2298–2304, 2016.
- Towards VQA models that can read. In CVPR, pages 8317–8326, 2019.
- Super-convergence: Very fast training of neural networks using large learning rates. In AIMLMOA, pages 369–386, 2019.
- Document image understanding. In FJCC, pages 87–95, 1986.
- Unifying vision, text, and layout for universal document processing. In CVPR, pages 19254–19264, 2023.
- Hierarchical multimodal transformers for multipage DocVQA. PR, 144:109834, 2023.
- LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- Attention is all you need. NeurIPS, 30, 2017.
- Gregory K Wallace. The jpeg still picture compression standard. Commun. ACM, 34(4):30–44, 1991.
- End-to-end scene text recognition. In ICCV, pages 1457–1464, 2011.
- LayoutLM: Pre-training of text and layout for document image understanding. In KDD, pages 1192–1200, 2020.
- LayoutxLM: Multimodal pre-training for multilingual visually-rich document understanding. arXiv preprint arXiv:2104.08836, 2021.
- mPLUG-DocOwl: Modularized multimodal large language model for document understanding. arXiv preprint arXiv:2307.02499, 2023a.
- UReader: Universal OCR-free visually-situated language understanding with multimodal large language model. arXiv preprint arXiv:2310.05126, 2023b.
- mPLUG-Owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178, 2023c.
- StrucTexTv2: Masked visual-textual prediction for document image pre-training. arXiv preprint arXiv:2303.00289, 2023.
- LLaVAR: Enhanced visual instruction tuning for text-rich image understanding. arXiv preprint arXiv:2306.17107, 2023.
- MiniGPT-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.
- Hao Feng (83 papers)
- Qi Liu (485 papers)
- Hao Liu (497 papers)
- Wengang Zhou (153 papers)
- Houqiang Li (236 papers)
- Can Huang (43 papers)
- Jingqun Tang (22 papers)