MathGLM-Vision: Solving Mathematical Problems with Multi-Modal Large Language Model (2409.13729v2)

Published 10 Sep 2024 in cs.CL and cs.AI

Abstract: LLMs have demonstrated significant capabilities in mathematical reasoning, particularly with text-based mathematical problems. However, current multi-modal LLMs (MLLMs), especially those specialized in mathematics, tend to focus predominantly on solving geometric problems but ignore the diversity of visual information available in other areas of mathematics. Moreover, the geometric information for these specialized mathematical MLLMs is derived from several public datasets, which are typically limited in diversity and complexity. To address these limitations, we aim to construct a fine-tuning dataset named MathVL, and develop a series of specialized mathematical MLLMs termed MathGLM-Vision by conducting Supervised Fine-Tuning (SFT) on MathVL with various parameter-scale backbones. To extensively evaluate the effectiveness of MathGLM-Vision, we conduct experiments on several public benchmarks and our curated MathVL-test consisting of 2,000 problems. Experimental results demonstrate that MathGLM-Vision achieves significant improvements compared with some existing models, including backbone models and open-source mathematical MLLMs. These findings indicate the importance of diversity dataset in enhancing the mathematical reasoning abilities of MLLMs.

PDF HTML Abstract

The paper "MathGLM-Vision: Solving Mathematical Problems with Multi-Modal LLM" introduces a novel approach to enhancing mathematical reasoning in multi-modal LLMs (MLLMs). The paper addresses the limitations of existing models which predominantly focus on solving geometric problems while neglecting the diversity and complexity of visual information required in other mathematical domains.

Summary of Key Contributions

Introduction of MathVL Dataset: The authors construct a fine-tuning dataset called MathVL, which is designed to improve the mathematical reasoning abilities of MLLMs. MathVL is unique in its incorporation of diverse mathematical problems integrating both textual and visual data. It includes open-source datasets and newly curated Chinese K12 educational content, enhancing the scope beyond typical geometric problems to cover arithmetic, algebra, and statistics.
Development of MathGLM-Vision Series: By fine-tuning on the MathVL dataset, the authors introduce a series of models referred to as MathGLM-Vision. These models are built using different parameter-scale backbones (GLM-4V-9B, CogVLM2, and CogVLM-32B) with the aim to achieve significant improvements in solving complex mathematical problems that carry visual components.
Evaluation and Results: The paper reports extensive evaluations across several public benchmarks alongside a newly created MathVL-test, consisting of 2,000 problems. MathGLM-Vision shows marked improvements over existing models, achieving significant relative performance boosts on benchmark datasets like MathVista-GPS. For instance, MathGLM-Vision-9B achieved a 39.68% improvement over its backbone model.
Role of Visual Information: One of the critical insights from the experiments is the demonstrated significance of visual inputs. The results emphasize how integrating visual information dramatically enhances model performance in mathematical reasoning tasks, with appreciable declines in performance observed when visual inputs are excluded.
Discussion on Limitations and Challenges: The authors bring attention to three primary challenges with current MLLMs:
- Overemphasis on geometric problems.
- Limited dataset diversity hindering model adaptability.
- Lack of capability to process multiple image inputs simultaneously.

Detailed Observations

Dataset Diversity: MathVL's inclusion of diverse subjects and problem types emphasizes the broad applicability of the MathGLM-Vision models. It captures the essential step-by-step reasoning required, found missing in many existing datasets.
Model Architecture: By leveraging pre-trained MLLMs and adding layers for multi-modal integration, MathGLM-Vision models combine parameters from general LLMs with specialized vision encoders, facilitating improved comprehension of visual and textual information.
Experimental Setup: The work incorporates both closed and open-source competitors in its evaluations, ensuring robust validation of MathGLM-Vision's effectiveness. It provides substantial performance lifts across competitive benchmarks.
Generalizability and Robustness: The combination of visual question answering datasets with MathVL ensures that MathGLM-Vision models are not just specialized, but retain robust general vision-language understanding skills.

In conclusion, the paper provides a comprehensive approach to enhancing mathematical reasoning in MLLMs through well-curated datasets and sophisticated model fine-tuning, addressing key shortcomings in current methodologies and setting a higher benchmark in multi-modal mathematical problem-solving.

PDF Markdown Bookmark Chat (Pro)

Authors (9)

Zhen Yang (160 papers)
Jinhao Chen (6 papers)
Zhengxiao Du (22 papers)
Wenmeng Yu (7 papers)
Weihan Wang (20 papers)
Wenyi Hong (14 papers)
Zhihuan Jiang (4 papers)
Bin Xu (192 papers)
Jie Tang (302 papers)

Citations (3)

View on Semantic Scholar

MathGLM-Vision: Solving Mathematical Problems with Multi-Modal Large Language Model (2409.13729v2)

Summary of Key Contributions

Detailed Observations

Related Papers