Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

TextSquare: Scaling up Text-Centric Visual Instruction Tuning (2404.12803v1)

Published 19 Apr 2024 in cs.CV and cs.LG
TextSquare: Scaling up Text-Centric Visual Instruction Tuning

Abstract: Text-centric visual question answering (VQA) has made great strides with the development of Multimodal LLMs (MLLMs), yet open-source models still fall short of leading models like GPT4V and Gemini, partly due to a lack of extensive, high-quality instruction tuning data. To this end, we introduce a new approach for creating a massive, high-quality instruction-tuning dataset, Square-10M, which is generated using closed-source MLLMs. The data construction process, termed Square, consists of four steps: Self-Questioning, Answering, Reasoning, and Evaluation. Our experiments with Square-10M led to three key findings: 1) Our model, TextSquare, considerably surpasses open-source previous state-of-the-art Text-centric MLLMs and sets a new standard on OCRBench(62.2%). It even outperforms top-tier models like GPT4V and Gemini in 6 of 10 text-centric benchmarks. 2) Additionally, we demonstrate the critical role of VQA reasoning data in offering comprehensive contextual insights for specific questions. This not only improves accuracy but also significantly mitigates hallucinations. Specifically, TextSquare scores an average of 75.1% across four general VQA and hallucination evaluation datasets, outperforming previous state-of-the-art models. 3) Notably, the phenomenon observed in scaling text-centric VQA datasets reveals a vivid pattern: the exponential increase of instruction tuning data volume is directly proportional to the improvement in model performance, thereby validating the necessity of the dataset scale and the high quality of Square-10M.

TextSquare: Leveraging a High-Quality, Large-scale Text-Centric Visual Question Answering Dataset for Enhanced Model Performance

Introduction

In the field of multimodal LLMs (MLLMs), the performance gap between open-source models and state-of-the-art closed-source counterparts such as GPT4V and Gemini has been significant. This disparity has been attributed to differences in model architecture, training strategies, and notably, the scale and quality of instruction tuning datasets utilized during model training. Addressing this gap, the paper introduces a systematic approach, dubbed "Square," for generating a substantial and high-quality text-centric visual question answering (VQA) dataset, termed Square-10M.

Dataset Construction: Square-10M

The Square-10M dataset is constructed through a novel four-step process involving self-questioning, answering, reasoning, and evaluation, based on sophisticated closed-source MLLMs. This approach not only facilitates the creation of a large, comprehensive dataset but also ensures its high quality by:

  • Generating contextually rich VQA pairs that are deeply evaluated for relevance and accuracy.
  • Providing detailed reasoning that supports the answers, thus enhancing the dataset’s utility for training robust models.
  • Employing rigorous filtering criteria during data evaluation to maintain high standards.

A diverse collection of text-rich images from varied sources like natural scenes, commerce, and academic documents has been utilized, ensuring the dataset's broad applicability across different VQA scenarios.

TextSquare Model Performance

Employing the Square-10M dataset, a new model, TextSquare, was trained and benchmarked against both open-source and closed-source models. TextSquare exhibits superior performance across several metrics:

  • Outperforms leading open-source models and scores comparative to or better than top-tier models like GPT4V and Gemini across multiple benchmarks.
  • Demonstrates significant advancements in VQA reasoning, showing improved contextual understanding and reduction in hallucinations due to the quality and scale of reasoning data within Square-10M.
  • Verification through numerous benchmarks reveals that scaling up the dataset size correlates directly with enhanced model performance and lower convergence loss.

Theoretical and Practical Implications

The findings underscore the significance of both the volume and quality of training data in developing competent multimodal models. The Square method significantly advances the generation and utilization of text-centric VQA datasets, which has the following implications:

  • Theoretical: Establishes a clear correlation between data scale, quality, and multimodal learning model efficacy, suggesting a potential threshold beyond which additional data yields diminishing returns.
  • Practical: Offers a robust framework for open-source communities to generate and utilize their own large-scale datasets to train models that can rival closed-source equivalents.

Speculation on Future Developments

Looking ahead, the methodologies introduced by Square-10M could guide the development of even larger and more diverse datasets. There's potential for exploring automatic enhancements to the self-evaluation and reasoning components, making them more efficient and less reliant on closed-source models. Additionally, further refinement of the data collection and generation processes could enable more tailored datasets that address specific gaps in current model capabilities, potentially leading toward models that better understand complex, multimodal interactions in VQA scenarios.

Conclusion

The Square strategy for dataset creation marks a significant step toward bridging the performance gap between open-source and closed-source multimodal models, primarily through enhancements in data quality and scale. This approach not only aids in advancing current model capabilities but also sets a foundational framework for future research and development in the field of text-centric visual question answering.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (52)
  1. Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems, 35:23716–23736, 2022.
  2. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. 2023.
  3. Internlm2 technical report. arXiv preprint arXiv:2403.17297, 2024.
  4. Allava: Harnessing gpt4v-synthesized data for a lite vision-language model. arXiv preprint arXiv:2402.11684, 2024.
  5. Sharegpt4v: Improving large multi-modal models with better captions. arXiv preprint arXiv:2311.12793, 2023.
  6. Tabfact: A large-scale dataset for table-based fact verification. arXiv preprint arXiv:1909.02164, 2019.
  7. Instructblip: Towards general-purpose vision-language models with instruction tuning. Advances in Neural Information Processing Systems, 36, 2024.
  8. DeepMind. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. 2023.
  9. Internlm-xcomposer2: Mastering free-form text-image composition and comprehension in vision-language large model. arXiv preprint arXiv:2401.16420, 2024.
  10. Docpedia: Unleashing the power of large multimodal model in the frequency domain for versatile document understanding. arXiv preprint arXiv:2311.11810, 2023.
  11. Unidoc: A universal large multimodal model for simultaneous text detection, recognition, spotting and understanding. arXiv preprint arXiv:2308.11592, 2023.
  12. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6904–6913, 2017.
  13. Vizwiz grand challenge: Answering visual questions from blind people. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3608–3617, 2018.
  14. Cogagent: A visual language model for gui agents. arXiv preprint arXiv:2312.08914, 2023.
  15. mplug-docowl 1.5: Unified structure learning for ocr-free document understanding. arXiv preprint arXiv:2403.12895, 2024.
  16. Bliva: A simple multimodal llm for better handling of text-rich visual questions. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 2256–2264, 2024.
  17. Icdar2019 competition on scanned receipt ocr and information extraction. In 2019 International Conference on Document Analysis and Recognition (ICDAR), pages 1516–1520. IEEE, 2019.
  18. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6700–6709, 2019.
  19. A diagram is worth a dozen images. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part IV 14, pages 235–251. Springer, 2016.
  20. Ocr-free document understanding transformer. In European Conference on Computer Vision, pages 498–517. Springer, 2022.
  21. Visual information extraction in the wild: practical dataset and end-to-end solution. In International Conference on Document Analysis and Recognition, pages 36–53. Springer, 2023.
  22. Pix2struct: Screenshot parsing as pretraining for visual language understanding. In International Conference on Machine Learning, pages 18893–18912. PMLR, 2023.
  23. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International conference on machine learning, pages 19730–19742. PMLR, 2023.
  24. Enhancing visual document understanding with contrastive learning in large visual-language models. arXiv preprint arXiv:2402.19014, 2024.
  25. Evaluating object hallucination in large vision-language models. arXiv preprint arXiv:2305.10355, 2023.
  26. Monkey: Image resolution and text label are important things for large multi-modal models. arXiv preprint arXiv:2311.06607, 2023.
  27. Hrvda: High-resolution visual document assistant. arXiv preprint arXiv:2404.06918, 2024.
  28. Llava-next: Improved reasoning, ocr, and world knowledge, 2024.
  29. Visual instruction tuning. Advances in neural information processing systems, 36, 2024.
  30. On the hidden mystery of ocr in large multimodal models. arXiv preprint arXiv:2305.07895, 2023.
  31. Textmonkey: An ocr-free large multimodal model for understanding document. arXiv preprint arXiv:2403.04473, 2024.
  32. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  33. Layoutllm: Layout instruction tuning with large language models for document understanding. arXiv preprint arXiv:2404.05225, 2024.
  34. Chartqa: A benchmark for question answering about charts with visual and logical reasoning. arXiv preprint arXiv:2203.10244, 2022.
  35. Infographicvqa. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1697–1706, 2022.
  36. Docvqa: A dataset for vqa on document images. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 2200–2209, 2021.
  37. Learning to generate instruction tuning datasets for zero-shot task adaptation. arXiv preprint arXiv:2402.18334, 2024.
  38. OpenAI. Gpt-4v(ision) system card. 2023.
  39. Compositional semantic parsing on semi-structured tables. arXiv preprint arXiv:1508.00305, 2015.
  40. Instruction tuning with gpt-4. arXiv preprint arXiv:2304.03277, 2023.
  41. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  42. Synthetic prompting: Generating chain-of-thought demonstrations for large language models. In International Conference on Machine Learning, pages 30706–30775. PMLR, 2023.
  43. Towards vqa models that can read. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8317–8326, 2019.
  44. Omniparser: A unified framework for text spotting, key information extraction and table recognition. arXiv preprint arXiv:2403.19128, 2024.
  45. Self-instruct: Aligning language models with self-generated instructions. arXiv preprint arXiv:2212.10560, 2022.
  46. Towards improving document understanding: An exploration on text-grounding via mllms. arXiv preprint arXiv:2311.13194, 2023.
  47. Vary: Scaling up the vision vocabulary for large vision-language models. arXiv preprint arXiv:2312.06109, 2023.
  48. mplug-docowl: Modularized multimodal large language model for document understanding. arXiv preprint arXiv:2307.02499, 2023.
  49. Ureader: Universal ocr-free visually-situated language understanding with multimodal large language model. arXiv preprint arXiv:2310.05126, 2023.
  50. Structextv2: Masked visual-textual prediction for document image pre-training. arXiv preprint arXiv:2303.00289, 2023.
  51. Llavar: Enhanced visual instruction tuning for text-rich image understanding. arXiv preprint arXiv:2306.17107, 2023.
  52. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (16)
  1. Jingqun Tang (22 papers)
  2. Chunhui Lin (9 papers)
  3. Zhen Zhao (85 papers)
  4. Shu Wei (17 papers)
  5. Binghong Wu (12 papers)
  6. Qi Liu (485 papers)
  7. Hao Feng (83 papers)
  8. Yang Li (1140 papers)
  9. Siqi Wang (68 papers)
  10. Lei Liao (18 papers)
  11. Wei Shi (116 papers)
  12. Yuliang Liu (82 papers)
  13. Hao Liu (497 papers)
  14. Yuan Xie (188 papers)
  15. Xiang Bai (221 papers)
  16. Can Huang (43 papers)
Citations (15)