MathCanvas Framework: Visual Math Reasoning

Updated 1 February 2026

MathCanvas framework is a comprehensive methodology enabling unified diagram generation and symbolic reasoning for challenging visual mathematics tasks.
It employs a two-stage training process that includes visual manipulation pretraining and strategic visual-aided reasoning fine-tuning.
Empirical benchmarks demonstrate significant improvements in geometry, trigonometry, and algebra compared to traditional large multimodal models.

The MathCanvas framework is a comprehensive methodology and toolkit for enabling unified Large Multimodal Models (LMMs) with intrinsic visual chain-of-thought (VCoT) capabilities, specifically tuned for complex, visual mathematical reasoning tasks. MathCanvas addresses the limitations of previous Visual CoT approaches—often constrained by reliance on external tools or limited diagram fidelity—by providing datasets, model architectures, and a benchmark to jointly learn and strategically interleave high-fidelity diagram generation with symbolic reasoning for mathematical domains such as geometry, algebra, and statistics (Shi et al., 16 Oct 2025).

1. Visual Manipulation Pretraining

The initial phase of MathCanvas pretraining, the Visual Manipulation Stage, targets diagram generation and editing, utilizing a large-scale corpus organized into two primary datasets: MathCanvas-Imagen and MathCanvas-Edit.

MathCanvas-Imagen comprises 10 million caption-to-diagram pairs. Sourcing is threefold: 5.4 million pairs are harvested stepwise from MathCanvas-Edit edit trajectories, 4 million from the ImgCode-8.6M code-to-diagram dataset annotated with GPT-4.1-mini captions, and 0.612 million from public resources (MAVIS, TR-CoT). All diagrams are standardized to 512×512, stored as both raw PNG and VAE-encoded latent vectors.
MathCanvas-Edit aggregates 5.2 million step-by-step editing trajectories. Of these, 4.2 million arise from "Competition-Level Mining" of olympiad-style geometry problems using AlphaGeometry LLM strategies (subject to symbolic constraint filtering), while 1 million – the "Foundational Structure Generation" subset – are algorithmically generated across shape primitives and geometric relations. Each edit step maps between source and target VAE latent representations conditioned on an edit instruction; random rendering seeds diversify output.

Pretraining is conducted on the BAGEL-7B-MoT backbone, with a distinct split between an "Understanding Expert" (frozen) and a "Generation Expert" (tuned), joined via cross-modal attention. Diagrams are reconstructed using a Rectified-Flow loss, with ground-truth diagram VAE latent $z_0$ , noise schedule $\sigma(t)$ , and noised latent $z_t = z_0 + \sigma(t)\epsilon$ for $\epsilon \sim \mathcal{N}(0, I)$ . The Generation Expert predicts a denoised latent $\hat{z}_0$ , minimizing expected squared error:

$L_{\mathrm{flow}} = \mathbb{E}_{t, z_0, \epsilon} \left[\| r_\theta(z_t, t, c) - z_0 \|^2\right].$

No cross-entropy component is present at this stage.

2. Strategic Visual-Aided Reasoning Fine-Tuning

Following visual manipulation pretraining, MathCanvas applies a Strategic Visual-Aided Reasoning phase, leveraging the MathCanvas-Instruct dataset and fully updating all weights in the LMM.

MathCanvas-Instruct comprises 219,000 examples after filtering and deduplication. Each example interleaves text and visual steps: textual reasoning is explicitly separated from visual outputs using the <im_end> marker, and the <|vision_start|> token signals diagram generation from noised VAE tokens. This annotation format teaches the LMM to decide the strategic insertion of diagrammatic reasoning steps within multimodal synthetic and real math problems.

Fine-tuning is conducted with all weights trainable, jointly optimizing a weighted loss combining cross-entropy (for next-token prediction, including visual start/stop tokens) and Rectified-Flow (for diagram segments), with

$L_{\mathrm{Stage\,II}} = \lambda_{\mathrm{CE}} L_{\mathrm{CE}} + \lambda_{\mathrm{flow}} L_{\mathrm{flow}},$

where $\lambda_{\mathrm{CE}} = 0.25$ , $\lambda_{\mathrm{flow}} = 1.0$ . The model learns end-to-end when to emit visual outputs (diagram drawing) versus continuing textual reasoning, without fixed tool-calling policies. Dual Classifier-Free Guidance (CFG) during inference improves diagram quality.

3. MathCanvas-Bench: Benchmark and Evaluation

Rigorous evaluation is facilitated by MathCanvas-Bench, a curated 3,000-problem benchmark drawn from the held-out portion of MathCanvas-Instruct after deduplication and category balancing. Only generative-answer (freeform) problems are retained; multiple-choice items are excluded. Category sampling covers Algebra, Analytic Geometry, Plane Geometry, Solid Geometry, Trigonometry, Statistics, Calculus, and Transformational Geometry, with up-sampling for rare types by a $(\mathrm{category\, proportion})^{0.7}$ weight.

Performance is assessed using the VLMEvalKit toolkit to guarantee uniform prompts and decoding conditions across 20 LMMs. Metrics include:

Complete Accuracy: 1 if all sub-questions correct, 0 otherwise.
Weighted Scoring: Assigns weights to sub-questions within a problem via

$\sigma(t)$ 0

where $\sigma(t)$ 1 is the number of sub-questions.

Automated scoring leverages GPT-4.1 in a two-stage JSON evaluation protocol.

4. Quantitative and Ablative Results

The MathCanvas-trained model, BAGEL-Canvas, achieves a 34.4% weighted score on MathCanvas-Bench, representing an 86% relative improvement over the BAGEL baseline (18.5%). Sub-domain gains are substantial: Plane Geometry (+19.2), Solid Geometry (+12.3), and Trigonometry (+27.1). The model outperforms closed-source GPT-4.1 (30.0) and Gemini-2.0-Flash (32.6) among non-proprietary models.

Generalization is confirmed by improvements on external public benchmarks (MathVista: +10.5 points; MathVerse Text-Dominant: +16.2, Lite: +17.9; MathVision: +8.8 overall, with pronounced gains in Analytic Geometry and Algebra).

Ablation studies reveal:

Omitting MathCanvas-Edit pretrain leads to a drop of 2.4 points (34.4 → 32.0).
Further removal of MathCanvas-Imagen reduces the score to 30.8.
Fine-tuning without visual data ("BAGEL-Canvas-Text") scores 30.9.
Forcing full model to skip images at inference ("Skip Image") yields 31.9. This suggests the visual manipulation pretraining stages contribute significant performance even to downstream strategically interleaved reasoning.

5. Implementation Protocols and Convergence

Training is conducted on 16 NVIDIA H800 GPUs using AdamW with decoupled weight decay. Stage I utilizes a learning rate of $\sigma(t)$ 2 (80k steps, cosine decay) with a batch size of approximately 46k tokens. The Understanding Expert is frozen, with ViT-cond dropout at 0.3, and optimization is exclusively against Rectified-Flow loss. Stage II sets a learning rate of $\sigma(t)$ 3 (16k steps, 500-step warmup), batch size of 51k tokens, all weights unfrozen, ViT-cond dropout at 0.1, and a total loss combining CE (0.25) and Flow (1.0) components. Dual CFG guidance ratio is 1.0 at inference.

Convergence is determined for Stage I by monitoring validation loss on 100,000 held-out pretrain pairs, with early stopping if plateaued for 5,000 steps. Stage II tracks the weighted score on a 10% SFT-validation split, ceasing training when improvements are less than 0.2 points over 2,000 steps.

6. Visual Chain-of-Thought Interleaving: Mechanisms and Examples

MathCanvas operationalizes the intrinsic visual chain-of-thought process through specialized annotation and model architecture. Key mechanisms include <|vision_start|> to trigger diagram generation, <im_end> markers delimiting segments, and token-level gating for visual/textual mode prediction. The model is trained end-to-end to learn strategic transition points between modalities, dispensing with hard-coded toolcalls or external policies.

Qualitative examples demonstrate this interleaving: for instance,

"Start with triangle ABC."
"Draw the circumcircle of triangle ABC." | <|vision_start|> → diagram generated.
"Mark point D where circle meets AB again." | subsequent diagram emitted.

Similarly, in fine-tuned MathCanvas-Instruct outputs:

Text: "Now drop the altitude from A to BC." <im_end> triggers <|vision_start|> and diagram.
Further text resumes: "Thus triangles ABD and ACD are right, so…."

These behaviors demonstrate the model’s acquired capacity to alternate between symbolic and diagrammatic reasoning steps, supporting complex, human-like problem-solving workflows in visual mathematics domains.

7. Significance and Context

MathCanvas establishes a unified approach to multimodal mathematical reasoning, tackling the bottleneck left by text-only LLMs in geometry and related fields. Its intrinsic VCoT capability is built through alignment of high-capacity LMMs, rich visual datasets, systematic annotation strategies, and domain-specific benchmarks. The empirical results indicate not only substantial absolute and relative improvement over strong baselines but also robust generalization to diverse public benchmarks. A plausible implication is that strategic visual-textual interleaving—underpinned by large-scale, interleaved training—constitutes a core ingredient for advancing machine mathematical reasoning beyond the textual regime (Shi et al., 16 Oct 2025).

Markdown Report Issue Upgrade to Chat

References (1)

MathCanvas: Intrinsic Visual Chain-of-Thought for Multimodal Mathematical Reasoning (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MathCanvas Framework.