Papers
Topics
Authors
Recent
Search
2000 character limit reached

Chain-of-Thought Reasoning in AI

Updated 28 December 2025
  • Chain-of-thought reasoning is a technique where models generate intermediate steps to decompose complex problems, enhancing transparency and structured decision-making.
  • It is applied across domains—including multimodal mathematical tasks—where integrating visual and textual cues is critical despite modest accuracy gains.
  • Recent innovations such as intrinsic visual CoT, process-level supervision, and token-level alignment strategies are driving improvements in stepwise reasoning performance.

Multimodal mathematical reasoning refers to the use of models and algorithms that jointly process and integrate visual, textual, and sometimes auditory modalities to achieve accurate, interpretable, and compositional mathematical reasoning. The field aims to close the gap between machine and human mathematical proficiency on problems where diagrams, natural language, algebraic expressions, and, increasingly, other modalities (e.g., spoken instructions or video) must be understood and reasoned over in combination. The domain is driven by the emergence of large vision-LLMs (VLMs/MLLMs) and systematic benchmarking, which reveal both the promise and the present inadequacy of current systems for truly vision-grounded mathematical reasoning.

1. Problem Formulation and Benchmark Paradigms

Multimodal mathematical reasoning is generally defined as the task of producing a mathematically valid answer aa to a question (Q,{Ii})(Q, \{I_i\}), where QQ is a natural-language (and possibly symbolic) problem statement, and {Ii}\{I_i\} is a set of one or more associated visual signals (diagrams, graphs, photos, video frames). This is formalized as a function: R:(Q,{Ii})AR: (Q, \{I_i\}) \longrightarrow A Benchmarks are designed to span an array of mathematical topics (geometry, algebra, graph theory, statistics, logic), and across tasks that require deductive, spatial, and quantitative reasoning.

Benchmarking paradigms include:

A key distinction is made between "knowledge-centric" benchmarks (emphasizing domain content) and "perceptual-reliant" benchmarks (explicitly requiring visual integration for correctness).

2. Dataset Structures and Modalities

Modern benchmarks are constructed to maximize mathematical, visual, and contextual diversity:

  • Visual context types: Vector and raster diagrams, hand-drawn sketches, photo captures, scanned worksheets, chart images, and video frames (Wang et al., 2024, Rasheed et al., 5 Jun 2025).
  • Task types: Multiple-choice, fill-in-the-blank, open-ended generation, proof construction, chain-of-thought (CoT) annotation.
  • Granularity of annotation: Many datasets provide step-by-step solutions, allowing for both outcome and process-based evaluation (Sun et al., 2024, Zhang et al., 6 Aug 2025, Xiang et al., 2024).
  • Examples:
    • VCBench: 1,720 elementary-level problems, each with 2–18 images, six cognitive domains (calendar, spatial, geometric, etc.), emphasizing explicit cross-image dependencies (Wang et al., 24 Apr 2025).
    • MathSight: Each university-level problem is rendered as original, hand-drawn, photo-captured, and text-only variants, enabling controlled studies of visual robustness (Wang et al., 28 Nov 2025).
    • MM-MATH: 5,929 open-ended geometry problems with step-level error categorization and outcome/process scoring (Sun et al., 2024).

A summary table of representative dataset properties follows:

Benchmark Scale Visual Type(s) Domain Diversity Stepwise Annotations
VCBench (Wang et al., 24 Apr 2025) 1,720 Multi-image, photo Broad (6 domains) No
HC-M3D (Liu et al., 6 Mar 2025) 1,851 Diagram, controlled var Geometry, logic Yes (pairwise)
MATH-V (Wang et al., 2024) 3,040 Contest diagrams 16 subjects Yes
MathSight (Wang et al., 28 Nov 2025) 661+1,387 3 image variants+text University level Yes (confidence)
MV-MATH (Wang et al., 28 Feb 2025) 2,009 2–8 interleaved images K–12/11 subjects Yes
VideoMathQA (Rasheed et al., 5 Jun 2025) 420 Video+audio+text 10 domains Yes (multi-step)

3. Empirical Findings and Core Challenges

Extensive experiments reveal the following empirical phenomena:

Dominance of Textual Cues over Visual Information:

Across benchmarks, the impact of visual input on accuracy is minimal except in highly controlled settings. For instance, in MathSight, Qwen3-VL outperforms even GPT-5 when all images are withheld—indicating models often solve multimodal math problems primarily through their linguistic priors (Wang et al., 28 Nov 2025). In HC-M3D, shuffling or masking diagrams during training causes only a 0–4 percentage point drop, compared to 20–60 points on general VQA tasks (Liu et al., 6 Mar 2025). VCBench shows that collapsing all images onto a single canvas boosts model accuracy by 42%, since existing architectures better exploit layout salience than true compositional integration (Wang et al., 24 Apr 2025).

Insufficient Visual Granularity:

Current vision backbones (e.g., CLIP, ViT) do not reliably distinguish subtle geometric modifications (e.g., point swaps, line moves) necessary for fine-grained mathematical inference (Liu et al., 6 Mar 2025, Li et al., 7 Jun 2025). In VisioMath, model accuracy falls rapidly when answer options are visually similar, exposing the inability to resolve small but semantically crucial visual distinctions (Li et al., 7 Jun 2025).

Process-wise Bottlenecks in Diagram Comprehension:

Error analysis emphasizes that the majority of first-step failures arise in diagram misinterpretation (e.g., mislocating a midpoint, confusing parallel lines), accounting for over 60% of errors in MM-MATH and similar proportions in other process-level studies (Sun et al., 2024).

Deterioration on Problem Complexity and Visual Degradation:

The effect of visual input decreases as task complexity rises. On MathSight, as difficulty moves from undergraduate to graduate level, models increasingly ignore images, and text-only accuracy surpasses image-assisted performance (Wang et al., 28 Nov 2025). When diagrams are degraded from typeset to hand-drawn or photo, accuracy drops even further.

Limited Gains from Enhanced Visual Encoders and CoT Prompting:

Stacking multiple vision encoders (e.g., CLIP + DINO + SIGLIP) improves generic VQA but has negligible or negative effect on math accuracy (Liu et al., 6 Mar 2025). Chain-of-thought prompting leads to small, inconsistent improvements on complex multimodal tasks but does not yield the stepwise gains observed in textual domains (Wang et al., 2024, Wang et al., 28 Nov 2025).

4. Architectural Innovations and Training Methodologies

Recent methodological advances target both architectural and data-centric obstacles:

  • Reason Chunking and Critical Reasoning Units (CRUs): ViRC segments reasoning into intermediate propositions, switching visual context only at chunk boundaries. This chunked approach, supported by the CRUX dataset, mimics human visual reasoning and demonstrably increases accuracy and generalization (Wang et al., 16 Dec 2025).
  • Progressive Multimodal Alignment: Math-PUMA enforces token-level alignment between visual- and text-rich modalities using Kullback-Leibler divergence on next-token distributions, eliminating the "accuracy pyramid" that favors text-only to vision-only. A three-stage regime—textual bootstrapping, KL-based alignment, then multimodal instruction tuning—achieves balanced performance across modality variants (Zhuang et al., 2024).
  • Intrinsic Visual Chain-of-Thought (VCoT): MathCanvas enables end-to-end visual reasoning within a unified LMM via explicit diagram generation/editing at each deduction step, outperforming prior visual CoT methods on interleaved visual-text benchmarks (Shi et al., 16 Oct 2025).
  • Generative Step-level Critique and Correction (GM-PRM): GM-PRM moves beyond binary step verification, training a model to interpret, critique, and correct each reasoning step, producing refined outputs that combine interpretability with data-efficient accuracy gains (Zhang et al., 6 Aug 2025).
  • CoT Diversity Supervision and Reinforcement Learning: Qwen-VL-DP models, trained on MathV-DP's diverse solution trajectories with GRPO RL, learn to represent and discriminate among multiple valid mathematical strategies, resulting in improved accuracy and effective semantic diversity (Shi et al., 3 Jul 2025).
  • Atomic Step “Slow Thinking”: AtomThink decomposes reasoning into atomic minimal inferences, supervised with PRM-guided search, achieving substantial gains on both MathVista and MathVerse (Xiang et al., 2024).
  • Describe-then-Reason Training: The VCAR pipeline decouples visual description (comprehension) from mathematical reasoning, boosting performance especially on problems demanding precise figure understanding (Jia et al., 2024).

5. Evaluation Metrics, Process Evaluation, and Failure Taxonomy

Process-level and outcome-level evaluation are now standard:

  • Outcome accuracy: Fraction of items for which final answer is exactly correct (with symbolic/numeric match as appropriate).
  • Stepwise/process evaluation: Models' reasoning traces are compared to annotated chains; failures classified into: diagram misinterpretation, logic slips, calculation errors, and text misreading (Sun et al., 2024).
  • Visual reliance metrics: Rv=Acc(I,T)Acc(T)Acc(I,T)×100%R_v = \frac{\mathrm{Acc}(I,T) - \mathrm{Acc}(T)}{\mathrm{Acc}(I,T)} \times 100\% (Liu et al., 6 Mar 2025).
  • Diversity and alignment metrics: Semantic diversity of generated solutions, alignment of internal logits across modalities (Shi et al., 3 Jul 2025, Zhuang et al., 2024).
  • Human-vs-model gap: Even the best models cap at 50–70% on easier K–12 benchmarks and often remain below 25–35% on university-level, multi-image, or video-based settings, compared to human performance of 76–93% (Wang et al., 28 Feb 2025, Wang et al., 24 Apr 2025, Gupta et al., 2024, Wang et al., 2024, Rasheed et al., 5 Jun 2025).

Frequently observed failure types include:

  • Over-reliance on text or answer choices: Models “shortcut” using distributional biases in answer formatting or repeated question types (Liu et al., 6 Mar 2025).
  • Positional and layout bias: Preference for certain answer positions, especially in image-choice tasks (Li et al., 7 Jun 2025).
  • Spatial relation errors: Misunderstanding adjacency, overlap, or geometric relationships in diagrams (Gupta et al., 2024, Li et al., 7 Jun 2025).

6. Future Directions and Open Research Problems

Key avenues identified by recent works:

A plausible implication is that only through systematic measurement of vision dependence, adversarial control for text-leakage pathways, and explicit process-level supervision can progress toward human-level multimodal mathematical reasoning be realized. This suggests that benchmark and algorithm design must work in tandem, with future research focusing on both architectural alignment and challenge-driven evaluation.

References

Definition Search Book Streamline Icon: https://streamlinehq.com
References (17)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Chain-of-Thought (CoT) Reasoning.