MathCoder-VL: Bridging Vision and Code for Enhanced Multimodal Mathematical Reasoning (2505.10557v1)

Published 15 May 2025 in cs.CV, cs.AI, and cs.CL

Abstract: Natural language image-caption datasets, widely used for training Large Multimodal Models, mainly focus on natural scenarios and overlook the intricate details of mathematical figures that are critical for problem-solving, hindering the advancement of current LMMs in multimodal mathematical reasoning. To this end, we propose leveraging code as supervision for cross-modal alignment, since code inherently encodes all information needed to generate corresponding figures, establishing a precise connection between the two modalities. Specifically, we co-develop our image-to-code model and dataset with model-in-the-loop approach, resulting in an image-to-code model, FigCodifier and ImgCode-8.6M dataset, the largest image-code dataset to date. Furthermore, we utilize FigCodifier to synthesize novel mathematical figures and then construct MM-MathInstruct-3M, a high-quality multimodal math instruction fine-tuning dataset. Finally, we present MathCoder-VL, trained with ImgCode-8.6M for cross-modal alignment and subsequently fine-tuned on MM-MathInstruct-3M for multimodal math problem solving. Our model achieves a new open-source SOTA across all six metrics. Notably, it surpasses GPT-4o and Claude 3.5 Sonnet in the geometry problem-solving subset of MathVista, achieving improvements of 8.9% and 9.2%. The dataset and models will be released at https://github.com/mathLLM/MathCoder.

Summary

The paper introduces an innovative image-to-code method that bridges visual math figures and code for precise cross-modal alignment.
It details a two-stage training process with ImgCode-8.6M and MM-MathInstruct-3M to iteratively refine math reasoning on complex visual tasks.
The approach significantly boosts geometry problem solving and achieves state-of-the-art performance on key math benchmarks.

This paper, "MathCoder-VL: Bridging Vision and Code for Enhanced Multimodal Mathematical Reasoning" (2505.10557), introduces a novel approach to improving the multimodal mathematical reasoning capabilities of Large Multimodal Models (LMMs). The core idea is to leverage the precise relationship between mathematical figures and the code used to generate them (like TikZ or Python) for better cross-modal alignment and data synthesis.

The authors identify that existing image-caption datasets often lack the detailed and accurate information present in mathematical figures crucial for problem-solving. Their solution involves using code as a ground truth for visual information.

Key Components and Implementation:

Image-to-Code Model and Dataset ({FigCodifier and {ImgCode-8.6M):
- Problem: Existing datasets for image-to-code (specifically for math graphics) are small, and even commercial models struggle with this task.
- Approach: The authors propose an iterative, model-in-the-loop approach to co-develop an image-to-code model ({FigCodifier) and a large-scale dataset ({ImgCode-8.6M).
- Data Collection: They started by collecting 3 million math-related images from various sources:
  - DaTikZ [belouadi2024automatikz]: 119K image-TikZ pairs as seed data.
  - K12 Problem-Solving Dataset: 1.57 million images from 4.6 million problems across 19 subjects. Included equation images converted to \LaTeX~text using MinerU [wang2024mineru] to filter out problems with only equation images.
  - Mathematical Textbooks: Images extracted from 8K PDF textbooks (202K images).
  - Images and TikZ code from recent arXiv papers (45K with code, 681K without code).
  - Open-Source Datasets: MathV360K [shi2024mathllava] and MultiMath [peng2024multimath].
- Iterative Synthesis: An initial image-to-code model (based on InternVL) was trained on seed data. This model was then used to generate code (TikZ or Python) for the collected images. Images were rendered from the generated code. Only successfully generated $\langle Image^C, Code \rangle$ pairs were added to the dataset. This process was repeated iteratively to refine the model and grow the dataset.
- Code Conversion: They used GPT-4o mini to translate TikZ code into Python code (primarily using Matplotlib) to diversify the code types in the dataset. This yielded 3.1 million image-Python pairs.
- Data Cleaning: A rigorous process was applied to ensure data quality, including validating code executability, removing duplicates, filtering low-quality images (e.g., nearly blank, random lines, black squares), and removing code accessing external files.
- Result: The final dataset, {ImgCode-8.6M, contains 4.3M image-TikZ pairs and 4.3M image-Python pairs. The final model, {FigCodifier, is based on InternVL2-8B [chen2024internvl2].
Math Instruction Fine-tuning Data ({MM-MathInstruct-3M):
- Problem: Multimodal math datasets lack diverse, newly synthesized images.
- Approach: Leverage {FigCodifier to generate diverse new images and then synthesize math problems based on them.
- K12-2M Dataset: Structured 2 million K12 math problems (from the collected data) with images and simple solutions. Used GPT-4o mini to convert simple solutions into detailed, step-by-step Chain-of-Thought (CoT) solutions.
- Synthetic Data with New Images:
  - Used {FigCodifier with a higher temperature (0.7) on 1.57 million raw images from K12-2M to generate code that renders new, diverse images.
  - Used Qwen2.5-72B-Instruct [qwen2_5] to generate math reasoning questions based on the newly synthesized images and their corresponding code.
  - Used Qwen2.5-Math-72B-Instruct [yang2024qwen25math] and Qwen2.5-72B-Instruct to generate step-by-step solutions. Solutions were kept only if both models produced consistent answers.
- Result: {MM-MathInstruct-3M combines the K12-2M dataset (2M samples) with 1M new synthetic samples featuring diverse, newly generated images.
MathCoder-VL Training:
- Base Models: InternVL-Chat-2B-V1-5 [gao2024internvlmini] and InternVL2-8B [chen2024internvl2].
- Stage 1: Image-to-Code Mid-training: Trained the model on {ImgCode-8.6M. The vision encoder and MLP projector were trainable, while the LLM backbone was frozen. This enhances the vision encoder's ability to capture math-specific visual features.
- Stage 2: Math Instruction Fine-tuning: Fine-tuned the entire model on {MM-MathInstruct-3M. This improves multimodal math reasoning using the high-quality, diverse instruction data.
- Implementation Details: Trained for one epoch per stage, using DeepSpeed ZeRO-1 [rajbhandari2020zero] and FlashAttention [dao2022flashattention].

Practical Applications and Performance:

MathCoder-VL demonstrates strong capabilities in multimodal mathematical reasoning, particularly excelling in tasks requiring detailed visual understanding like geometry.

Enhanced Cross-Modal Alignment: By training on image-code pairs, the model learns a precise alignment between visual elements in mathematical figures and their underlying structural description in code. This is crucial for accurately interpreting diagrams and extracting necessary information for problem-solving.
Improved Math Problem Solving: Fine-tuning on {MM-MathInstruct-3M, which includes synthetic problems with diverse images and step-by-step solutions, enhances the model's reasoning abilities, especially for multi-step problems and complex visual inputs.
Geometry Expertise: The model shows outstanding performance in geometry problems, outperforming state-of-the-art models like GPT-4o on specific geometry benchmarks (e.g., MATH-Vision plane geometry subsets). This highlights the effectiveness of the image-to-code approach in teaching the model about spatial relationships and geometric properties.
State-of-the-Art Performance (Open-Source): MathCoder-VL models (2B and 8B) achieve SOTA results among open-source LMMs on various math benchmarks like MATH-Vision, MathVerse, MathVista (GPS), and GAOKAO-MM.
Competitive with Closed-Source Models: The 8B model is competitive with and sometimes surpasses large closed-source models like GPT-4V, GPT-4-turbo, Claude 3.5 Sonnet, and GPT-4o on specific math tasks, particularly in geometry.

Implementation Considerations:

Data Requirements: Building {ImgCode-8.6M and {MM-MathInstruct-3M requires significant data collection from diverse sources (DaTikZ, K12 problems, textbooks, arXiv, open datasets). Processing raw data (e.g., converting equation images to \LaTeX) and implementing robust data cleaning pipelines are critical.
Computational Resources: The iterative training of the image-to-code model and the two-stage training of MathCoder-VL are computationally intensive, requiring substantial GPU resources (32 and 64 A800 80GB GPUs for 2B and 8B models respectively).
Code Generation and Rendering: Relying on external tools/models (GPT-4o mini) for code translation and requiring executable code for image rendering introduces dependencies and potential points of failure in the data generation pipeline. A robust validation system for generated code is essential.
Model Architecture: The method builds upon existing LMM architectures (InternVL). Implementing this requires integrating the vision encoder, projector, and LLM components and managing their training stages (freezing/unfreezing layers).
Synthetic Data Quality: The quality of synthetic problems depends heavily on the capabilities of the LLMs used for question and solution generation and the consistency filtering mechanism.

Potential Limitations and Future Directions:

The current dataset ({MM-MathInstruct-3M) is limited to mathematics and primarily English content. Expanding to other STEM subjects (physics, chemistry) and languages would broaden its applicability.
Training larger models could potentially yield further performance improvements.
Exploring advanced post-training methods like reinforcement learning could further fine-tune the model's mathematical reasoning capabilities.

Overall, MathCoder-VL presents a practical framework for enhancing LMMs' mathematical reasoning through leveraging code-based data generation and a two-stage training process, demonstrating significant improvements, especially in visually-intensive math tasks like geometry. The approach highlights the value of structured data (like code) over natural language for precise cross-modal alignment in specialized domains.

PDF Markdown

GitHub

GitHub - mathllm/MathCoder: [ICLR 2024] Family of LLMs for mathematical reasoning. (265 stars)

MathCoder-VL: Bridging Vision and Code for Enhanced Multimodal Mathematical Reasoning (2505.10557v1)

Summary

Related Papers

GitHub