Chain-of-Thought Reasoning in AI
- Chain-of-thought reasoning is a technique where models generate intermediate steps to decompose complex problems, enhancing transparency and structured decision-making.
- It is applied across domains—including multimodal mathematical tasks—where integrating visual and textual cues is critical despite modest accuracy gains.
- Recent innovations such as intrinsic visual CoT, process-level supervision, and token-level alignment strategies are driving improvements in stepwise reasoning performance.
Multimodal mathematical reasoning refers to the use of models and algorithms that jointly process and integrate visual, textual, and sometimes auditory modalities to achieve accurate, interpretable, and compositional mathematical reasoning. The field aims to close the gap between machine and human mathematical proficiency on problems where diagrams, natural language, algebraic expressions, and, increasingly, other modalities (e.g., spoken instructions or video) must be understood and reasoned over in combination. The domain is driven by the emergence of large vision-LLMs (VLMs/MLLMs) and systematic benchmarking, which reveal both the promise and the present inadequacy of current systems for truly vision-grounded mathematical reasoning.
1. Problem Formulation and Benchmark Paradigms
Multimodal mathematical reasoning is generally defined as the task of producing a mathematically valid answer to a question , where is a natural-language (and possibly symbolic) problem statement, and is a set of one or more associated visual signals (diagrams, graphs, photos, video frames). This is formalized as a function: Benchmarks are designed to span an array of mathematical topics (geometry, algebra, graph theory, statistics, logic), and across tasks that require deductive, spatial, and quantitative reasoning.
Benchmarking paradigms include:
- Single-image grounding: Traditional tasks pairing with one diagram, probing geometric, algebraic, or chart interpretation skills (Wang et al., 2024).
- Multi-image/multi-visual fusion: Scenarios with multiple images, requiring cross-image relational or sequential reasoning (Wang et al., 28 Feb 2025, Wang et al., 24 Apr 2025).
- Image-variant controls: Comparison of visually similar but semantically distinct diagrams to force reliance on perception over language priors (Liu et al., 6 Mar 2025, Wang et al., 28 Nov 2025).
- Video-based mathematical reasoning: Dynamic problems extending grounding to temporally evolving multimodal streams (Rasheed et al., 5 Jun 2025).
A key distinction is made between "knowledge-centric" benchmarks (emphasizing domain content) and "perceptual-reliant" benchmarks (explicitly requiring visual integration for correctness).
2. Dataset Structures and Modalities
Modern benchmarks are constructed to maximize mathematical, visual, and contextual diversity:
- Visual context types: Vector and raster diagrams, hand-drawn sketches, photo captures, scanned worksheets, chart images, and video frames (Wang et al., 2024, Rasheed et al., 5 Jun 2025).
- Task types: Multiple-choice, fill-in-the-blank, open-ended generation, proof construction, chain-of-thought (CoT) annotation.
- Granularity of annotation: Many datasets provide step-by-step solutions, allowing for both outcome and process-based evaluation (Sun et al., 2024, Zhang et al., 6 Aug 2025, Xiang et al., 2024).
- Examples:
- VCBench: 1,720 elementary-level problems, each with 2–18 images, six cognitive domains (calendar, spatial, geometric, etc.), emphasizing explicit cross-image dependencies (Wang et al., 24 Apr 2025).
- MathSight: Each university-level problem is rendered as original, hand-drawn, photo-captured, and text-only variants, enabling controlled studies of visual robustness (Wang et al., 28 Nov 2025).
- MM-MATH: 5,929 open-ended geometry problems with step-level error categorization and outcome/process scoring (Sun et al., 2024).
A summary table of representative dataset properties follows:
| Benchmark | Scale | Visual Type(s) | Domain Diversity | Stepwise Annotations |
|---|---|---|---|---|
| VCBench (Wang et al., 24 Apr 2025) | 1,720 | Multi-image, photo | Broad (6 domains) | No |
| HC-M3D (Liu et al., 6 Mar 2025) | 1,851 | Diagram, controlled var | Geometry, logic | Yes (pairwise) |
| MATH-V (Wang et al., 2024) | 3,040 | Contest diagrams | 16 subjects | Yes |
| MathSight (Wang et al., 28 Nov 2025) | 661+1,387 | 3 image variants+text | University level | Yes (confidence) |
| MV-MATH (Wang et al., 28 Feb 2025) | 2,009 | 2–8 interleaved images | K–12/11 subjects | Yes |
| VideoMathQA (Rasheed et al., 5 Jun 2025) | 420 | Video+audio+text | 10 domains | Yes (multi-step) |
3. Empirical Findings and Core Challenges
Extensive experiments reveal the following empirical phenomena:
Dominance of Textual Cues over Visual Information:
Across benchmarks, the impact of visual input on accuracy is minimal except in highly controlled settings. For instance, in MathSight, Qwen3-VL outperforms even GPT-5 when all images are withheld—indicating models often solve multimodal math problems primarily through their linguistic priors (Wang et al., 28 Nov 2025). In HC-M3D, shuffling or masking diagrams during training causes only a 0–4 percentage point drop, compared to 20–60 points on general VQA tasks (Liu et al., 6 Mar 2025). VCBench shows that collapsing all images onto a single canvas boosts model accuracy by 42%, since existing architectures better exploit layout salience than true compositional integration (Wang et al., 24 Apr 2025).
Insufficient Visual Granularity:
Current vision backbones (e.g., CLIP, ViT) do not reliably distinguish subtle geometric modifications (e.g., point swaps, line moves) necessary for fine-grained mathematical inference (Liu et al., 6 Mar 2025, Li et al., 7 Jun 2025). In VisioMath, model accuracy falls rapidly when answer options are visually similar, exposing the inability to resolve small but semantically crucial visual distinctions (Li et al., 7 Jun 2025).
Process-wise Bottlenecks in Diagram Comprehension:
Error analysis emphasizes that the majority of first-step failures arise in diagram misinterpretation (e.g., mislocating a midpoint, confusing parallel lines), accounting for over 60% of errors in MM-MATH and similar proportions in other process-level studies (Sun et al., 2024).
Deterioration on Problem Complexity and Visual Degradation:
The effect of visual input decreases as task complexity rises. On MathSight, as difficulty moves from undergraduate to graduate level, models increasingly ignore images, and text-only accuracy surpasses image-assisted performance (Wang et al., 28 Nov 2025). When diagrams are degraded from typeset to hand-drawn or photo, accuracy drops even further.
Limited Gains from Enhanced Visual Encoders and CoT Prompting:
Stacking multiple vision encoders (e.g., CLIP + DINO + SIGLIP) improves generic VQA but has negligible or negative effect on math accuracy (Liu et al., 6 Mar 2025). Chain-of-thought prompting leads to small, inconsistent improvements on complex multimodal tasks but does not yield the stepwise gains observed in textual domains (Wang et al., 2024, Wang et al., 28 Nov 2025).
4. Architectural Innovations and Training Methodologies
Recent methodological advances target both architectural and data-centric obstacles:
- Reason Chunking and Critical Reasoning Units (CRUs): ViRC segments reasoning into intermediate propositions, switching visual context only at chunk boundaries. This chunked approach, supported by the CRUX dataset, mimics human visual reasoning and demonstrably increases accuracy and generalization (Wang et al., 16 Dec 2025).
- Progressive Multimodal Alignment: Math-PUMA enforces token-level alignment between visual- and text-rich modalities using Kullback-Leibler divergence on next-token distributions, eliminating the "accuracy pyramid" that favors text-only to vision-only. A three-stage regime—textual bootstrapping, KL-based alignment, then multimodal instruction tuning—achieves balanced performance across modality variants (Zhuang et al., 2024).
- Intrinsic Visual Chain-of-Thought (VCoT): MathCanvas enables end-to-end visual reasoning within a unified LMM via explicit diagram generation/editing at each deduction step, outperforming prior visual CoT methods on interleaved visual-text benchmarks (Shi et al., 16 Oct 2025).
- Generative Step-level Critique and Correction (GM-PRM): GM-PRM moves beyond binary step verification, training a model to interpret, critique, and correct each reasoning step, producing refined outputs that combine interpretability with data-efficient accuracy gains (Zhang et al., 6 Aug 2025).
- CoT Diversity Supervision and Reinforcement Learning: Qwen-VL-DP models, trained on MathV-DP's diverse solution trajectories with GRPO RL, learn to represent and discriminate among multiple valid mathematical strategies, resulting in improved accuracy and effective semantic diversity (Shi et al., 3 Jul 2025).
- Atomic Step “Slow Thinking”: AtomThink decomposes reasoning into atomic minimal inferences, supervised with PRM-guided search, achieving substantial gains on both MathVista and MathVerse (Xiang et al., 2024).
- Describe-then-Reason Training: The VCAR pipeline decouples visual description (comprehension) from mathematical reasoning, boosting performance especially on problems demanding precise figure understanding (Jia et al., 2024).
5. Evaluation Metrics, Process Evaluation, and Failure Taxonomy
Process-level and outcome-level evaluation are now standard:
- Outcome accuracy: Fraction of items for which final answer is exactly correct (with symbolic/numeric match as appropriate).
- Stepwise/process evaluation: Models' reasoning traces are compared to annotated chains; failures classified into: diagram misinterpretation, logic slips, calculation errors, and text misreading (Sun et al., 2024).
- Visual reliance metrics: (Liu et al., 6 Mar 2025).
- Diversity and alignment metrics: Semantic diversity of generated solutions, alignment of internal logits across modalities (Shi et al., 3 Jul 2025, Zhuang et al., 2024).
- Human-vs-model gap: Even the best models cap at 50–70% on easier K–12 benchmarks and often remain below 25–35% on university-level, multi-image, or video-based settings, compared to human performance of 76–93% (Wang et al., 28 Feb 2025, Wang et al., 24 Apr 2025, Gupta et al., 2024, Wang et al., 2024, Rasheed et al., 5 Jun 2025).
Frequently observed failure types include:
- Over-reliance on text or answer choices: Models “shortcut” using distributional biases in answer formatting or repeated question types (Liu et al., 6 Mar 2025).
- Positional and layout bias: Preference for certain answer positions, especially in image-choice tasks (Li et al., 7 Jun 2025).
- Spatial relation errors: Misunderstanding adjacency, overlap, or geometric relationships in diagrams (Gupta et al., 2024, Li et al., 7 Jun 2025).
6. Future Directions and Open Research Problems
Key avenues identified by recent works:
- Dataset-level innovations:
- Generation of adversarial pairs and multi-modal variants that force models to discriminate based on visual content (Liu et al., 6 Mar 2025, Zhou et al., 2024, Wang et al., 28 Nov 2025).
- Process-level annotation for step trace evaluation and error diagnosis (Sun et al., 2024, Zhang et al., 6 Aug 2025).
- Scaling to dynamic video-based and real-world scenarios to probe temporal and narrative mathematical reasoning (Rasheed et al., 5 Jun 2025).
- Modeling strategies:
- Neuro-symbolic fusion pipelines integrating vision modules for primitive/object detection and symbolic math engines for validation and inference (Zhou et al., 2024, Li et al., 7 Jun 2025).
- Explicit attention supervisions and graph-based reasoning architectures for geometric and combinatorial tasks (Liu et al., 6 Mar 2025, Wang et al., 16 Dec 2025).
- Visual grounding and chain-of-thought gating mechanisms deciding when, what, and how to invoke visual operations (Shi et al., 16 Oct 2025, Wang et al., 16 Dec 2025).
- Instruction and prompting advancements:
- Structured multi-step chain-of-thought templates and few-shot exemplars illustrating visual-text integration (Wang et al., 24 Apr 2025, Xiang et al., 2024).
- Prompting models to verbalize visual observations as an explicit stage before reasoning (Jia et al., 2024).
- Evaluation paradigm shifts:
- Balanced modality benchmarking (text-rich, vision-rich, interleaved) with fine-grained process metrics (Zhuang et al., 2024).
- Introduction of partial-credit and atomic-step stepwise accuracy measures (Sun et al., 2024, Xiang et al., 2024).
A plausible implication is that only through systematic measurement of vision dependence, adversarial control for text-leakage pathways, and explicit process-level supervision can progress toward human-level multimodal mathematical reasoning be realized. This suggests that benchmark and algorithm design must work in tandem, with future research focusing on both architectural alignment and challenge-driven evaluation.
References
- (Liu et al., 6 Mar 2025, Wang et al., 28 Nov 2025, Sun et al., 2024, Wang et al., 16 Dec 2025, Wang et al., 24 Apr 2025, Li et al., 7 Jun 2025, Wang et al., 2024, Wang et al., 28 Feb 2025, Zhuang et al., 2024, Jia et al., 2024, Shi et al., 16 Oct 2025, Shi et al., 3 Jul 2025, Xiang et al., 2024, Rasheed et al., 5 Jun 2025, Zhou et al., 2024)