InfiMM-WebMath-40B: Advancing Multimodal Pre-Training for Enhanced Mathematical Reasoning
The presented paper introduces InfiMM-WebMath-40B, a novel large-scale multimodal pre-training dataset specifically designed to bolster mathematical reasoning capabilities in Multimodal LLMs (MLLMs). The dataset comprises approximately 24 million web pages, 85 million associated image URLs, and approximately 40 billion text tokens derived from CommonCrawl, with a specific focus on mathematical content.
Dataset Construction Methodology
The dataset construction methodology integrates several systematic filtering and extraction processes to ensure the high quality and relevance of the dataset:
- Text Extraction and Initial Filtering: Text was extracted using Trafilatura and a modified version of Resiliparse to preserve mathematical content, including LaTeX equations and image URLs. Initial filtering with fastText reduced the dataset size from 57.2 billion to 9.5 billion web documents, emphasizing high recall.
- Deduplication: Content and URL deduplication strategies, inspired by methodologies from RefinedWeb and FineWeb, reduced redundant samples, further refining the dataset to 3.9 billion web pages.
- Rule-based Filtering: Targeted rules, such as removing documents with high punctuation ratios or NSFW content, enhanced content quality while retaining key mathematical elements.
- High-Precision Filtering: A high-precision fastText classifier was trained using LLaMA3-70B-Instruct to evaluate the mathematical quality of samples, culminating in a refined dataset of 24 million documents.
Evaluation and Model Architecture
The efficacy of InfiMM-WebMath-40B was tested on models specifically continuing pre-trained and instruction fine-tuned using this dataset. The architecture employed includes the SigLip model for visual feature extraction and Perceiver Resampler for modality alignment, integrated with LLMs from DeepSeek-Coder series.
- Modality Alignment Stage: Alignment of image and text modalities was achieved using a subset of the DFN-2B dataset.
- Continue Pre-training Stage: The focus was on integrating mathematical knowledge using the InfiMM-WebMath-40B dataset, keeping the vision encoder frozen and training the Perceiver Resampler and LLM.
- Instruction Fine-tuning Stage: Diverse instruction datasets were utilized to acclimate models to common chat templates and question-answer formats in mathematical contexts.
Numerical Results and Comparative Performance
The paper reported substantial improvements in the mathematical reasoning capabilities of models trained on InfiMM-WebMath-40B, evaluated using benchmarks like MathVerse and We-Math:
- MathVerse: InfiMM-Math models demonstrated superior performance across multiple categories, significantly outperforming multiple open-source and proprietary models, achieving a notable 34.5 score for DeepSeek-Coder-1.5-7B, compared to the leading proprietary model, GPT-4V (39.4), but significantly surpassing other open-source models like LLaVA-NeXT and Maverick.
- We-Math: InfiMM-Math models exhibited strong results (an average score of 20.6), higher than many existing models, highlighting the robust multimodal understanding and reasoning enabled by the new dataset.
Implications and Future Directions
The introduction of InfiMM-WebMath-40B addresses a critical void in the open-source community by providing a comprehensive multimodal mathematical dataset for MLLMs. The improvements in performance across benchmarks underscore the potential of multimodal pre-training for enhancing mathematical reasoning.
Future research could explore integrating more sophisticated visual encoders tailored for mathematical symbols and diagrams, which are essential for disciplines requiring multimodal understanding. Moreover, applying reinforcement learning techniques to further enhance reasoning capabilities can also be explored. The developments from this work set a foundation for advancing the mathematical understanding of AI models, paving the way for more nuanced and accurate problem-solving abilities in specialized scientific fields.