InfiMM-WebMath-40B: Advancing Multimodal Pre-Training for Enhanced Mathematical Reasoning (2409.12568v1)

Published 19 Sep 2024 in cs.CV and cs.MM

Abstract: Pre-training on large-scale, high-quality datasets is crucial for enhancing the reasoning capabilities of LLMs, especially in specialized domains such as mathematics. Despite the recognized importance, the Multimodal LLMs (MLLMs) field currently lacks a comprehensive open-source pre-training dataset specifically designed for mathematical reasoning. To address this gap, we introduce InfiMM-WebMath-40B, a high-quality dataset of interleaved image-text documents. It comprises 24 million web pages, 85 million associated image URLs, and 40 billion text tokens, all meticulously extracted and filtered from CommonCrawl. We provide a detailed overview of our data collection and processing pipeline. To demonstrate the robustness of InfiMM-WebMath-40B, we conducted evaluations in both text-only and multimodal settings. Our evaluations on text-only benchmarks show that, despite utilizing only 40 billion tokens, our dataset significantly enhances the performance of our 1.3B model, delivering results comparable to DeepSeekMath-1.3B, which uses 120 billion tokens for the same model size. Nevertheless, with the introduction of our multi-modal math pre-training dataset, our models set a new state-of-the-art among open-source models on multi-modal math benchmarks such as MathVerse and We-Math. We release our data at https://huggingface.co/datasets/Infi-MM/InfiMM-WebMath-40B.

PDF Abstract

InfiMM-WebMath-40B: Advancing Multimodal Pre-Training for Enhanced Mathematical Reasoning

The presented paper introduces InfiMM-WebMath-40B, a novel large-scale multimodal pre-training dataset specifically designed to bolster mathematical reasoning capabilities in Multimodal LLMs (MLLMs). The dataset comprises approximately 24 million web pages, 85 million associated image URLs, and approximately 40 billion text tokens derived from CommonCrawl, with a specific focus on mathematical content.

Dataset Construction Methodology

The dataset construction methodology integrates several systematic filtering and extraction processes to ensure the high quality and relevance of the dataset:

Text Extraction and Initial Filtering: Text was extracted using Trafilatura and a modified version of Resiliparse to preserve mathematical content, including LaTeX equations and image URLs. Initial filtering with fastText reduced the dataset size from 57.2 billion to 9.5 billion web documents, emphasizing high recall.
Deduplication: Content and URL deduplication strategies, inspired by methodologies from RefinedWeb and FineWeb, reduced redundant samples, further refining the dataset to 3.9 billion web pages.
Rule-based Filtering: Targeted rules, such as removing documents with high punctuation ratios or NSFW content, enhanced content quality while retaining key mathematical elements.
High-Precision Filtering: A high-precision fastText classifier was trained using LLaMA3-70B-Instruct to evaluate the mathematical quality of samples, culminating in a refined dataset of 24 million documents.

Evaluation and Model Architecture

The efficacy of InfiMM-WebMath-40B was tested on models specifically continuing pre-trained and instruction fine-tuned using this dataset. The architecture employed includes the SigLip model for visual feature extraction and Perceiver Resampler for modality alignment, integrated with LLMs from DeepSeek-Coder series.

Modality Alignment Stage: Alignment of image and text modalities was achieved using a subset of the DFN-2B dataset.
Continue Pre-training Stage: The focus was on integrating mathematical knowledge using the InfiMM-WebMath-40B dataset, keeping the vision encoder frozen and training the Perceiver Resampler and LLM.
Instruction Fine-tuning Stage: Diverse instruction datasets were utilized to acclimate models to common chat templates and question-answer formats in mathematical contexts.

Numerical Results and Comparative Performance

The paper reported substantial improvements in the mathematical reasoning capabilities of models trained on InfiMM-WebMath-40B, evaluated using benchmarks like MathVerse and We-Math:

MathVerse: InfiMM-Math models demonstrated superior performance across multiple categories, significantly outperforming multiple open-source and proprietary models, achieving a notable 34.5 score for DeepSeek-Coder-1.5-7B, compared to the leading proprietary model, GPT-4V (39.4), but significantly surpassing other open-source models like LLaVA-NeXT and Maverick.
We-Math: InfiMM-Math models exhibited strong results (an average score of 20.6), higher than many existing models, highlighting the robust multimodal understanding and reasoning enabled by the new dataset.

Implications and Future Directions

The introduction of InfiMM-WebMath-40B addresses a critical void in the open-source community by providing a comprehensive multimodal mathematical dataset for MLLMs. The improvements in performance across benchmarks underscore the potential of multimodal pre-training for enhancing mathematical reasoning.

Future research could explore integrating more sophisticated visual encoders tailored for mathematical symbols and diagrams, which are essential for disciplines requiring multimodal understanding. Moreover, applying reinforcement learning techniques to further enhance reasoning capabilities can also be explored. The developments from this work set a foundation for advancing the mathematical understanding of AI models, paving the way for more nuanced and accurate problem-solving abilities in specialized scientific fields.