TextSquare: Leveraging a High-Quality, Large-scale Text-Centric Visual Question Answering Dataset for Enhanced Model Performance
Introduction
In the field of multimodal LLMs (MLLMs), the performance gap between open-source models and state-of-the-art closed-source counterparts such as GPT4V and Gemini has been significant. This disparity has been attributed to differences in model architecture, training strategies, and notably, the scale and quality of instruction tuning datasets utilized during model training. Addressing this gap, the paper introduces a systematic approach, dubbed "Square," for generating a substantial and high-quality text-centric visual question answering (VQA) dataset, termed Square-10M.
Dataset Construction: Square-10M
The Square-10M dataset is constructed through a novel four-step process involving self-questioning, answering, reasoning, and evaluation, based on sophisticated closed-source MLLMs. This approach not only facilitates the creation of a large, comprehensive dataset but also ensures its high quality by:
- Generating contextually rich VQA pairs that are deeply evaluated for relevance and accuracy.
- Providing detailed reasoning that supports the answers, thus enhancing the dataset’s utility for training robust models.
- Employing rigorous filtering criteria during data evaluation to maintain high standards.
A diverse collection of text-rich images from varied sources like natural scenes, commerce, and academic documents has been utilized, ensuring the dataset's broad applicability across different VQA scenarios.
TextSquare Model Performance
Employing the Square-10M dataset, a new model, TextSquare, was trained and benchmarked against both open-source and closed-source models. TextSquare exhibits superior performance across several metrics:
- Outperforms leading open-source models and scores comparative to or better than top-tier models like GPT4V and Gemini across multiple benchmarks.
- Demonstrates significant advancements in VQA reasoning, showing improved contextual understanding and reduction in hallucinations due to the quality and scale of reasoning data within Square-10M.
- Verification through numerous benchmarks reveals that scaling up the dataset size correlates directly with enhanced model performance and lower convergence loss.
Theoretical and Practical Implications
The findings underscore the significance of both the volume and quality of training data in developing competent multimodal models. The Square method significantly advances the generation and utilization of text-centric VQA datasets, which has the following implications:
- Theoretical: Establishes a clear correlation between data scale, quality, and multimodal learning model efficacy, suggesting a potential threshold beyond which additional data yields diminishing returns.
- Practical: Offers a robust framework for open-source communities to generate and utilize their own large-scale datasets to train models that can rival closed-source equivalents.
Speculation on Future Developments
Looking ahead, the methodologies introduced by Square-10M could guide the development of even larger and more diverse datasets. There's potential for exploring automatic enhancements to the self-evaluation and reasoning components, making them more efficient and less reliant on closed-source models. Additionally, further refinement of the data collection and generation processes could enable more tailored datasets that address specific gaps in current model capabilities, potentially leading toward models that better understand complex, multimodal interactions in VQA scenarios.
Conclusion
The Square strategy for dataset creation marks a significant step toward bridging the performance gap between open-source and closed-source multimodal models, primarily through enhancements in data quality and scale. This approach not only aids in advancing current model capabilities but also sets a foundational framework for future research and development in the field of text-centric visual question answering.