MAVIS: Mathematical Visual Instruction Tuning with an Automatic Data Engine (2407.08739v2)

Published 11 Jul 2024 in cs.CV

Abstract: The mathematical capabilities of Multi-modal LLMs (MLLMs) remain under-explored with three areas to be improved: visual encoding of math diagrams, diagram-language alignment, and chain-of-thought (CoT) reasoning. This draws forth an urgent demand for an effective training paradigm and a large-scale, comprehensive dataset with detailed CoT rationales, which is challenging to collect and costly to annotate manually. To tackle this issue, we propose MAVIS, a MAthematical VISual instruction tuning pipeline for MLLMs, featuring an automatic data engine to efficiently create mathematical visual datasets. We design the data generation process to be entirely independent of human intervention or GPT API usage, while ensuring the diagram-caption correspondence, question-answer correctness, and CoT reasoning quality. With this approach, we curate two datasets, MAVIS-Caption (558K diagram-caption pairs) and MAVIS-Instruct (834K visual math problems with CoT rationales), and propose four progressive stages for training MLLMs from scratch. First, we utilize MAVIS-Caption to fine-tune a math-specific vision encoder (CLIP-Math) through contrastive learning, tailored for improved diagram visual encoding. Second, we also leverage MAVIS-Caption to align the CLIP-Math with a LLM by a projection layer, enhancing vision-language alignment in mathematical domains. Third, we adopt MAVIS-Instruct to perform the instruction tuning for robust problem-solving skills, and term the resulting model as MAVIS-7B. Fourth, we apply Direct Preference Optimization (DPO) to enhance the CoT capabilities of our model, further refining its step-wise reasoning performance. Code and data will be released at https://github.com/ZrrSkywalker/MAVIS

PDF HTML Abstract

MAVIS: Mathematical Visual Instruction Tuning

The paper "MAVIS: Mathematical Visual Instruction Tuning" introduces an innovative approach in advancing Multi-modal LLMs (MLLMs) to enhance their visual mathematical problem-solving capabilities. Despite the impressive performance of MLLMs across various domains, their proficiency in interpreting and reasoning within visual mathematical contexts has been notably inadequate. This work aims to address these deficiencies by proposing MAVIS, a comprehensive training paradigm specifically designed for improving MLLMs in visual mathematical scenarios.

Key Contributions

The authors identify three critical areas that hinder the current effectiveness of MLLMs in visual mathematics: visual encoding of mathematical diagrams, alignment of diagrams with language, and accurate mathematical reasoning. To address these, the MAVIS framework incorporates three progressive training stages and introduces significant contributions, providing both new datasets and a robust training pipeline.

Datasets:
- MAVIS-Caption: This dataset includes 588K diagram-caption pairs focused on diverse mathematical topics such as plane geometry, analytic geometry, and functions. The dataset is curated to enhance the visual encoding capabilities of MLLMs through a math-specific vision encoder.
- MAVIS-Instruct: Comprising 834K visual math problems, this dataset provides structured problems with annotated CoT rationales, aimed at refining reasoning skills. It draws from various sources and reduces textual redundancy to emphasize visual elements, covering broad mathematical domains.
Training Stages:
- Stage 1: CLIP-Math Encoder: The initial stage involves fine-tuning a vision encoder using MAVIS-Caption with contrastive learning, specifically aimed at improving the visual representation of mathematical diagrams.
- Stage 2: Diagram-Language Alignment: This stage aligns the enhanced vision encoder with a LLM by employing a projection layer, further utilizing MAVIS-Caption to refine language and diagram integration.
- Stage 3: Instruction Tuning: Finally, MAVIS-Instruct is used to fine-tune MLLMs for CoT reasoning, significantly enhancing problem-solving capabilities in visual mathematical contexts.

Experimental Results

The MAVIS framework demonstrates substantial improvements in mathematical benchmarks. Notably, MAVIS-7B surpasses other open-source models by margins of +11.0% against similar 7B models and +3.0% over the second-best LLaVA-NeXT (110B). These results underscore MAVIS's efficacy in improving diagram interpretation and reasoning accuracy within MLLMs. The performance on benchmarks like MathVerse and specific datasets like GeoQA further validates MAVIS-7B's robustness in addressing visual mathematical challenges.

Theoretical and Practical Implications

The approach outlined in MAVIS provides a profound contribution to the field of AI by illustrating a method for significantly enhancing the mathematical reasoning capabilities of MLLMs in visual contexts. By creating specialized datasets and a multi-stage training procedure, the paper advances both the theoretical understanding and the practical capabilities of MLLMs. These advancements hold promise for various applications, including education, automated tutoring, and any domain where visual mathematical reasoning is essential.

Future Directions

The MAVIS framework opens avenues for future research, particularly in optimizing the training techniques for further scalability and applying similar methodologies to other domains requiring multimodal reasoning. Exploration into more generalized training frameworks that can be adapted to different subject matters could yield broader enhancements to the capabilities of MLLMs across disciplines.

In summary, this paper presents MAVIS as a structured and comprehensive approach to addressing the critical gap in visual mathematical reasoning within MLLMs, laying the groundwork for future exploration and application in the field of AI.

PDF Markdown Bookmark Chat (Pro)

Authors (14)

Renrui Zhang (100 papers)
Xinyu Wei (15 papers)
Dongzhi Jiang (13 papers)
Yichi Zhang (184 papers)
Ziyu Guo (49 papers)
Chengzhuo Tong (4 papers)
Jiaming Liu (156 papers)
Aojun Zhou (45 papers)
Bin Wei (25 papers)
Shanghang Zhang (172 papers)
Peng Gao (401 papers)
Hongsheng Li (340 papers)
Shicheng Li (23 papers)
Chunyuan Li (122 papers)

Citations (1)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - ZrrSkywalker/MAVIS: Mathematical Visual Instruction Tuning for Multi-modal Large Language Models (109 stars)

Tweets

https://twitter.com/gm8xx8/status/1811576963212914824