Math-LLaVA: Multimodal Math Reasoning Model

Updated 22 February 2026

Math-LLaVA is a multimodal model that integrates text and visual inputs to solve complex mathematical problems with high fidelity.
It leverages the large, diverse MathV360K dataset along with advanced augmentation and proportional sampling techniques for robust reasoning.
The model achieves state-of-the-art performance, reaching up to 56.5% accuracy on benchmarks like MathVista, closing the gap with proprietary systems.

Math-LLaVA is a multimodal LLM (MLLM) that extends the visual-language LLaVA-1.5 architecture to high-fidelity mathematical reasoning across a broad spectrum of problem types, combining both text and visual modalities. By leveraging an unprecedentedly large and diverse dataset—MathV360K—along with advanced augmentation and fine-tuning techniques, Math-LLaVA achieves state-of-the-art accuracy among open-source models on standard math reasoning benchmarks involving visual information, such as MathVista and MathVerse. It demonstrates that the bottleneck in multimodal mathematical understanding is primarily the breadth, depth, and coverage of multimodal data, rather than changes in model architecture or training objective (Shi et al., 2024).

1. Construction of the MathV360K Dataset

The foundation of Math-LLaVA is the MathV360K dataset, a curated and augmented set designed to maximize both the breadth (task/domain diversity) and depth (reasoning complexity) of multimodal (image, question, answer) triplets. The dataset was constructed as follows:

Source Pooling: 24 existing datasets were pooled, including fields such as Functional Question Answering (FQA), Geometry Problem Solving (GPS), Math Word Problems (MWP), Textbook Question Answering (TQA), and general/medical Visual Question Answering (VQA).
Clarity & Complexity Annotation: A subset of 10,000 images from these datasets was labeled for clarity (sharp/blurred) and reasoning complexity (4-level scale: 0–3) via GPT-4V annotation, then used to fine-tune two ViT-Large-Patch16 classifiers.
Proportional Sampling: After blurry images were filtered, a 2:3:4:1 ratio over [complexity levels 0–3] was enforced, yielding 40,000 high-quality, stratified image–question–answer triplets.
Automated Augmentation: Approximately 320,000 new pairs were synthesized by:
1. AskImg: GPT-4V generated novel questions for each image with multi-step reasoning, producing about 200,000 pairs.
2. CompQ: Questions were rewritten into more complex, multi-step forms (40,000 pairs).
3. RephQ & SimpQ: Questions were rephrased for logical equivalence (RephQ, 40,000) or systematically underspecified (SimpQ, 40,000) to force greater image reliance.

Combining all, MathV360K comprises 360,000 question–answer pairs over 40,000 distinct images. Empirical analysis demonstrated that the 2:3:4:1 complexity ratio and the full augmentation pipeline maximize reasoning accuracy (Shi et al., 2024).

2. Model Architecture and Fine-Tuning Protocol

Math-LLaVA builds directly upon the LLaVA-1.5-13B pipeline, characterized by the following architectural design:

Visual Encoder: A pretrained ViT-Large-Patch16 is retained to extract 16×16 visual patch embeddings from a 336×336 RGB image.
Projection Layer: A single trainable linear transformation maps the image encoder outputs into the Vicuna LLM’s embedding space.
Text Processing: [IMG] tokens and projected visual embeddings are prepended to the Vicuna LLM prompt, followed by the question text.
Backbone LLM: Vicuna-v1.5-13B serves as the language modeling core, with all weights (including projection and Vicuna) jointly fine-tuned on MathV360K.

The only learning objective is autoregressive cross-entropy over true answer sequences, as in standard sequence-to-sequence instruction tuning. No specialized reasoning or contrastive objectives are applied. The model is trained for two epochs with batch size 16 and learning rate 2e–5 using AdamW.

3. Data Balancing, Sampling Strategies, and Augmentation

Maximal performance gains derive not just from dataset scale, but also from the mixture and augmentation protocol:

Task and Complexity Balancing: Each of the five major task groups (FQA, GPS, MWP, TQA, VQA) are represented proportionally, avoiding overrepresentation of any domain.
Augmentation Component Contributions: Ablation demonstrates that “AskImg” yields the largest single accuracy boost, but the full bundle of AskImg, CompQ, RephQ, and SimpQ is required for the 19 percentage point aggregate gain over the baseline model.
Complexity Ratio Effect: The [2:3:4:1] complexity distribution outperforms uniform or receding proportions, implying mid-to-high reasoning complexity is critical for robust multimodal problem-solving.

Augmentation serves not only to increase data quantity, but to diversify image grounding requirements, logical structure, linguistic surface form, and problem difficulty.

4. Performance Benchmarks and Comparative Analysis

Math-LLaVA sets leading performance levels on several multimodal mathematics benchmarks:

Model	MathVista (ALL)	GPS	MWP	GEO
LLaVA-1.5-13B	27.7%	22.7%	18.3%	22.8%
Math-LLaVA-DS	38.2%	47.2%	41.4%	45.6%
Math-LLaVA	46.6%	57.7%	56.5%	56.5%
GPT-4V	49.9%	50.5%	57.5%	51.0%

Math-LLaVA closes the open-source–proprietary gap, reaching 46.6% on MathVista minitest—approaching GPT-4V (49.9%) and surpassing previous open-source models by ~12 points. On the general multidisciplinary MMMU benchmark, Math-LLaVA remains robust: 38.3% overall versus 36.4% for LLaVA-1.5-13B (Shi et al., 2024).

5. Model Outputs and Reasoning Capabilities

Math-LLaVA produces step-by-step, chain-of-thought (CoT) solutions with correct symbolic manipulations, algebraic calculations, and quantitative reasoning even on multi-step visual math problems. Example types include:

Pie Chart Arithmetic: Computing differences from segmented pie chart data.
Geometric Diagram Calculations: Applying Pythagorean theorem to labeled diagrams.
Combinatorics on Grids: Counting geometric configurations (e.g., rectangle enumeration on dot grids).

These CoTs mirror standard mathematical solution writing, generating intermediate steps and box-formatted final answers in LaTeX when prompted (Shi et al., 2024).

6. Limitations and Open Research Directions

Despite substantial progress, Math-LLaVA and its underlying MathV360K data expose several limitations:

Lack of Explicit Rationale Annotations: MathV360K lacks explicit stepwise rationales or formal proof paths; intermediate reasoning is not directly supervised.
Limited Symbolic Diagram Manipulation: Problems requiring auxiliary construction or formal geometric proof (e.g., with auxiliary lines) remain challenging.
Robustness on Underspecified/Adversarial Input: Certain image designs can trigger inconsistent outputs; underspecification was used to probe this failure mode.
Integration with Formal Toolchains: Future directions include combining the model with external computer algebra systems (CAS) or geometric theorem provers for formal derivation and verification tasks.

The release of MathV360K and codebase provides a foundation for advancing toward richer multimodal mathematical intelligence (Shi et al., 2024).

Math-LLaVA advances multimodal mathematical reasoning by scaling the LLaVA pipeline with large, complexity-stratified, and heavily augmented mathematical-visual datasets. It contrasts with prior geometry-specialized approaches (such as G-LLaVA, which focused on logical-form annotation and contrastive sub-QA for geometric diagrams (Gao et al., 2023)) and models integrating meta-in-context learning for solid geometry (Geo-LLaVA (Xu et al., 2024)). Unlike models that emphasize specialized architecture changes, Math-LLaVA demonstrates that dataset diversity and strategic instruction tuning can close much of the capability gap with closed-source giants such as GPT-4V.

A plausible implication is that further advances may hinge more on enriched data with reasoning rationales, formal proof annotations, and explicit symbolic tool coupling than on architectural novelty alone.