Math-LLaVA: Bootstrapping Mathematical Reasoning for Multimodal Large Language Models (2406.17294v3)

Published 25 Jun 2024 in cs.CL

Abstract: LLMs have demonstrated impressive reasoning capabilities, particularly in textual mathematical problem-solving. However, existing open-source image instruction fine-tuning datasets, containing limited question-answer pairs per image, do not fully exploit visual information to enhance the multimodal mathematical reasoning capabilities of Multimodal LLMs (MLLMs). To bridge this gap, we address the lack of high-quality, diverse multimodal mathematical datasets by collecting 40K high-quality images with question-answer pairs from 24 existing datasets and synthesizing 320K new pairs, creating the MathV360K dataset, which enhances both the breadth and depth of multimodal mathematical questions. We introduce Math-LLaVA, a LLaVA-1.5-based model fine-tuned with MathV360K. This novel approach significantly improves the multimodal mathematical reasoning capabilities of LLaVA-1.5, achieving a 19-point increase and comparable performance to GPT-4V on MathVista's minitest split, and yielding leading performance on Math-V and MathVerse. Furthermore, Math-LLaVA demonstrates enhanced generalizability, showing substantial improvements on the MMMU benchmark. Our research highlights the importance of dataset diversity and synthesis in advancing MLLMs' mathematical reasoning abilities. The code and data are available at: \url{https://github.com/HZQ950419/Math-LLaVA}.

PDF HTML Abstract

The paper "Math-LLaVA: Bootstrapping Mathematical Reasoning for Multimodal LLMs" addresses the challenge of enhancing multimodal mathematical reasoning capabilities in Multimodal LLMs (MLLMs). The authors propose a methodology centered on the development of a robust dataset, called MathV360K, and a fine-tuned model, Math-LLaVA, based on LLaVA-1.5.

Key Contributions:

Dataset Construction:
- Development of the MathV360K dataset, comprising 40,000 high-quality images with question-answer (QA) pairs aggregated from 24 existing datasets, supplemented by an additional 320,000 synthetic QA pairs. This dataset spans diverse subjects like algebra, geometry, logic, and science, enhancing both breadth and depth in multimodal mathematical queries.
- Emphasis on increasing multimodal question diversity by mining images thoroughly to synthesize new QA pairs. This involves augmenting existing datasets with complex, logically consistent, and rephrased questions.
Model Development:
- Introduction of Math-LLaVA, a model built on the LLaVA-1.5 architecture, fine-tuned using MathV360K. This model aims to significantly extend the multimodal mathematical reasoning capabilities of its predecessor.
- Application of advanced techniques for data augmentation, ensuring that the newly synthesized dataset does not cause model overfitting but enhances generalizability instead.
Empirical Results:
- Math-LLaVA demonstrated a 19-point improvement over LLaVA-1.5 on the MathVista benchmark's minitest split, closely challenging the performance of closed-source models like GPT-4V.
- On the MathVista benchmark, Math-LLaVA surpassed other open-source models and systematically outperformed LLaVA-1.5 in different categories, such as Geometry Problem Solving (GPS) and multidisciplinary question answering tasks evaluated on the MMMU benchmark.
Approach to Data Challenges:
- The authors highlight the critical role of data selection and synthesis, particularly in improving logical consistency and addressing challenges like semantic underspecification in questions. They fine-tuned a Vision Transformer (ViT) model for selecting images based on clarity and comprehension complexity, promoting a training dataset with a balanced difficulty level distribution.
Future Directions:
- The authors recognize the absence of intermediate problem-solving steps in the dataset, planning to introduce such components to further bolster the MLLMs' capabilities in future research.

Overall, the paper provides an extensive analysis of enhancing LLMs with a focus on mathematical reasoning in a multimodal context, showcasing the efficacy of diverse, well-curated datasets combined with sophisticated fine-tuning. The approach underscores the importance of leveraging diverse and synthesized data to improve reasoning skills beyond mere text processing.

PDF Markdown Bookmark Chat (Pro)

Authors (8)

Wenhao Shi (8 papers)
Zhiqiang Hu (48 papers)
Yi Bin (22 papers)
Junhua Liu (33 papers)
Yang Yang (883 papers)
See-Kiong Ng (103 papers)
Lidong Bing (144 papers)
Roy Ka-Wei Lee (68 papers)

Citations (16)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

https://twitter.com/fly51fly/status/1806088309241937940

https://twitter.com/gm8xx8/status/1805795056667779398

https://twitter.com/pascalvalletfr/status/1808138692654940670

https://twitter.com/knishimae0531/status/1805930552631205996

https://twitter.com/arxivsanitybot/status/1806510742734000239