Math-PUMA: Progressive Upward Multimodal Alignment to Enhance Mathematical Reasoning (2408.08640v2)

Published 16 Aug 2024 in cs.CL

Abstract: Multimodal LLMs (MLLMs) excel in solving text-based mathematical problems, but they struggle with mathematical diagrams since they are primarily trained on natural scene images. For humans, visual aids generally enhance problem-solving, but MLLMs perform worse as information shifts from textual to visual modality. This decline is mainly due to their shortcomings in aligning images and text. To tackle aforementioned challenges, we propose Math-PUMA, a methodology focused on Progressive Upward Multimodal Alignment. This approach is designed to improve the mathematical reasoning skills of MLLMs through a three-stage training process, with the second stage being the critical alignment stage. We first enhance the LLM's mathematical reasoning capabilities with extensive set of textual mathematical problems. We then construct a multimodal dataset with varying degrees of textual and visual information, creating data pairs by presenting each problem in at least two forms. By leveraging the Kullback-Leibler (KL) divergence of next-token prediction distributions to align visual and textual modalities, consistent problem-solving abilities are ensured. Finally, we utilize multimodal instruction tuning for MLLMs with high-quality multimodal data. Experimental results on multiple mathematical reasoning benchmarks demonstrate that the MLLMs trained with Math-PUMA surpass most open-source MLLMs. Our approach effectively narrows the performance gap for problems presented in different modalities. The code and data are available at: \url{https://github.com/wwzhuang01/Math-PUMA}.

PDF HTML Abstract

Progressive Upward MultiModal Alignment in Mathematical Reasoning with Math-PUMA

The discussed paper presents Math-PUMA (Progressive Upward Multimodal Alignment), a methodology targeted at enhancing the performance of Multimodal LLMs (MLLMs) in solving mathematical problems involving both textual and visual information. This approach specifically addresses the performance decline MLLMs experience when transitioning from text-based to image-based mathematical problem representation, a common issue due to their predominant training on natural scene images.

Methodology

The core proposal of Math-PUMA involves a three-stage training process designed to progressively align multimodal data representations and improve the mathematical reasoning of MLLMs:

Stage 1: Text-based Training
- The initial phase focuses on improving the intrinsic mathematical reasoning abilities of the LLM using a substantial dataset of text-based mathematical problems. This stage leverages the vast availability of text-based problem-solving data, culminating in the enhancement of the model's core problem-solving skills.
Stage 2: Progressive Upward Multimodal Alignment
- This critical alignment phase involves constructing a multimodal dataset comprising 692K data pairs that vary in the distribution of textual and visual information. Each data pair contains identical problem contexts but differs in modality. The alignment is achieved by minimizing the Kullback-Leibler (KL) divergence between the next-token prediction distributions of text-rich and vision-rich problems, thus ensuring consistency across different modalities. This alignment places emphasis on the model performing equally well on different modalities of the same mathematical problem.
Stage 3: Multimodal Instruction Tuning
- The final stage involves tuning the MLLM with a high-quality multimodal dataset. This phase fine-tunes the model using 996K multimodal problem-solving data, further refining its ability to handle problems presented in varying modalities.

Key Contributions

The paper highlights several significant contributions:

Dataset Curation: Creation of the Math-PUMA-1M dataset, which includes 692K data pairs and 996K multimodal mathematical data, providing a valuable resource for training MLLMs.
Alignment Methodology: Introduction of the Math-PUMA methodology, which leverages Progressive Upward Multimodal Alignment for improved mathematical reasoning.
Performance Benchmarking: Comprehensive experimental validation on multiple benchmarks demonstrates superior performance of Math-PUMA trained MLLMs over existing open-source models, effectively narrowing the performance gap between textual and visual problem representations.

Experimental Validation

The experimental setup involves evaluations on three popular benchmarks: MathVerse, MathVista, and We-Math. Results indicate that Math-PUMA significantly outperforms many existing open-source MLLMs. Notably:

MathVerse: Math-PUMA MLLMs achieve state-of-the-art results among open-source models, showing a marked improvement (around 10%) over the previous best-performing model, Math-LLaVA, and a competitive performance relative to GPT-4V.
MathVista: Across categories like geometry problem solving, algebraic reasoning, and scientific reasoning, Math-PUMA consistently achieves superior accuracy.
We-Math: The model trained with Math-PUMA exhibits high average scores and demonstrates strong mathematical reasoning processes, even surpassing some closed-source models in performance metrics.

Implications and Future Directions

The implications of this research are multi-faceted, impacting both theoretical frameworks and practical applications in AI. Practically, the improved capability of MLLMs to handle multimodal mathematical problems can enhance educational tools and automated tutoring systems. Theoretically, the Progressive Upward Multimodal Alignment approach provides a structured methodology to address the modality gap in multimodal learning tasks.

Future research directions may involve further optimization of data augmentation techniques, development of more sophisticated automatic data generation mechanisms, and exploration of additional alignment strategies to continuously enhance the reasoning abilities of MLLMs. Additionally, ensuring broader applicability across diverse problem domains and further reducing the gap to human-level performance remains a challenge worth addressing.

This paper provides a substantial advancement in the field of multimodal mathematical reasoning by introducing a novel, structured approach for aligning textual and visual data representations, significantly improving the overall performance of MLLMs.

PDF Markdown Bookmark Chat (Pro)

Authors (4)

Wenwen Zhuang (2 papers)
Xin Huang (222 papers)
Xiantao Zhang (6 papers)
Jin Zeng (11 papers)

Citations (4)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

https://twitter.com/gm8xx8/status/1825347755872530512