Papers
Topics
Authors
Recent
2000 character limit reached

CodePlot-CoT Paradigm

Updated 9 December 2025
  • CodePlot-CoT is a code-driven Chain-of-Thought paradigm that integrates executable plotting code to generate visual 'thoughts' for precise mathematical reasoning.
  • It interleaves natural language and code generation, executing Python plotting code to produce graphical representations that inform subsequent inference.
  • Empirical results on the Math-VR benchmark indicate significant gains in process scores and answer correctness, validating its practical effectiveness.

CodePlot-CoT is a code-driven Chain-of-Thought paradigm for mathematical visual reasoning in large language and vision–LLMs. Rather than relying solely on textual inference or pixel-based image generation, CodePlot-CoT integrates executable plotting code into the reasoning process, generating compact, precise graphical representations at each step. These rendered images act as intermediate "visual thoughts," informing further model reasoning and enabling the resolution of problems requiring geometric or functional visualization (Duan et al., 13 Oct 2025).

1. Motivation and Conceptual Foundations

Traditional LLMs process mathematical reasoning via purely textual Chain-of-Thought (CoT), which becomes inadequate for visually grounded tasks such as inserting auxiliary lines, analyzing diagram features, or constructing plots for y=f(x)y=f(x). Existing multimodal VLMs capable of generating images often lack geometric fidelity, as pixel-level synthesis (diffusion models or auto-regressive visual token approaches) is imprecise in angle, length, or colinearity, leading to errors in structured mathematical diagrams. The CodePlot-CoT paradigm addresses these issues by treating executable code (primarily Python using matplotlib) as the medium for generating intermediate images, thereby leveraging the syntactic rigor and expressive power of code for mathematical visualization. This shifts the difficulty from vision-centric image synthesis to code generation, an area where LLMs exhibit strong capabilities.

In the paradigm, each reasoning stage may prompt the model to emit plotting code, which is programmatically executed to produce images. These images are re-embedded and provided back to the model, serving as ground-truth visual hints for subsequent inference processes.

2. Architecture and Inference Workflow

The CodePlot-CoT inference process interleaves three key components:

  • Text Reasoner: A VLM generates natural-language reasoning steps.
  • Code Generator: The same VLM emits Python plotting code when visual assistance is required.
  • Code Executor and Renderer: This component executes the code and returns a rendered PNG image.

These modules are orchestrated as follows:

1
2
3
4
5
6
7
8
9
Question + (optional) Input Figure
    ↓
[VLM] → NATURAL LANGUAGE STEP_1
    ↓
If “visual needed” → CODE BLOCK_1 → [Executor] → IMAGE_1 → Vision Encoder Embedding
    ↓
[VLM] → NATURAL LANGUAGE STEP_2
    ↓
... iterate until final answer

Pseudocode outlining the multi-stage inference:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
Algorithm 1: CodePlot-CoT Inference
Input: Question Q, optional diagram I₀
context ← [Q, I₀]
for t = 1…T do
  token ← VLM.generate(context)
  if token starts_code_delimiter then
    code_block ← VLM.generate_code(context)
    image_t ← EXECUTOR.render(code_block)
    context.append(image_t)
  else
    text_token ← token
    context.append(text_token)
  end
  if “<END_OF_REASONING>” in context: break
end
answer ← VLM.extract_answer(context)
Return answer

Example plotting code generated for visual reasoning (circle of radius 2):

1
2
3
4
5
6
7
8
9
import numpy as np
import matplotlib.pyplot as plt
theta = np.linspace(0, 2*np.pi, 200)
x = 2*np.cos(theta); y = 2*np.sin(theta)
plt.figure(figsize=(4,4))
plt.plot(x,y,'k-')
plt.axis('equal')
plt.axis('off')
plt.savefig('fig.png', dpi=100)

3. Math-VR Dataset: Construction and Statistics

The Math-VR dataset underpins CodePlot-CoT training and evaluation. Initially, 900K high-school mathematical problems featuring at least one image in the solution were collected. After filtering and standardization via GPT-4.1, Math-VR comprises 178,150 bilingual (English and Chinese) question–answer pairs, with a split of approximately 173K training and 5K testing instances.

Coverage by modality and domain includes:

  • Modalities: 29% text-only; 71% multimodal (figures required in reasoning).
  • Domains: Geometry (81%), Algebra (13%), Calculus (4%), Statistics (2%).
  • Geometry Subtypes: Triangle, Circle, Quadrilateral, Area, Perimeter, etc.

Table: Key Dataset Statistics

Metric Minimum Maximum Average
Question length (tokens) 9 602 144.2
Solution length (tokens) 46 2753 591.1
Images per Question - 4 1.04
Images per Solution - 7 1.24

Example Problem (Geometry):

  • Q: “AB is tangent to ⊙O at B; extension of AO meets ⊙O at C. If ∠A=45° and AB=2, find AC.”
  • Reasoning process includes generating a right-triangle plot and measuring relevant sides.

4. Image-to-Code Conversion: MatplotCode

MatplotCode is a specialized image-to-code converter that transforms existing mathematical figures (PNG/JPG) into executable plotting code. Its base dataset, ImgCode-8.6M (MathCoder-VL), is filtered to ~1M high-fidelity geometric figures with paired Python codes.

Training proceeds in two stages using Qwen2.5VL-32B:

  1. Vision-Encoder Alignment: Train Vision Transformer (ViT) + MLP projector for one epoch, freezing the LM head.
  2. Full Fine-Tuning: Unfreeze and train all weights for two additional epochs.

Architecture components:

  • Vision backbone: ViT
  • Multimodal adapter: projects visual features to LM latent space
  • LM head: autoregressive text+code generation

Training objective—cross-entropy loss over code tokens:

Lcode=i=1Nlogp(cic<i,img)\mathcal{L}_{\text{code}} = -\sum_{i=1}^N \log p(c_i \mid c_{<i},\,\mathrm{img})

where cic_i are code tokens.

5. Training Pipeline and Optimization Methodology

CodePlot-CoT initialization leverages MatplotCode post vision-alignment. A supervised fine-tuning set (SFT) is curated by applying MatplotCode to Math-VR images and filtering via GPT-4.1. Training proceeds for 5,000 steps:

  • Batch size: 256
  • Learning rate: 3×1053 \times 10^{-5}
  • Hardware: 32 × NVIDIA H200 GPUs, ~36 h runtime

Multi-task loss at each decoding step:

L=Ltext+λLcode\mathcal{L} = \mathcal{L}_{\text{text}} + \lambda\,\mathcal{L}_{\text{code}}

where

Ltext=t ⁣logpθ(wtw<t,ctx),Lcode=k ⁣logpθ(ckc<k,ctx)\mathcal{L}_{\text{text}} = -\sum_{t}\!\log p_{\theta}(w_t \mid w_{<t},\,\mathrm{ctx}), \qquad \mathcal{L}_{\text{code}} = -\sum_{k}\!\log p_{\theta}(c_k \mid c_{<k},\,\mathrm{ctx})

The default λ\lambda is 1.

6. Evaluation and Empirical Performance

On 2,500 English Math-VR test questions, CodePlot-CoT (32B) demonstrates notable improvements versus its base model (Qwen2.5-VL-32B):

Metric Base Model CodePlot-CoT Δ
Process Score (PS) 33.7 47.0 +13.3
Answer Correctness 10.0% 22.1% +12.1pp

A qualitative example: for the problem "isosceles triangle perimeter 36 cm, altitude to base 12 cm → cos B?", the system emits code to draw the triangle, executes the plot, augments its context with the rendered image, and then completes the algebraic reasoning, ultimately generating the answer 513\frac{5}{13}. CodePlot-CoT narrows the multimodal reasoning gap with Gemini-2.5-Pro by providing verifiable visual steps.

Example generated plotting code:

1
2
import matplotlib.pyplot as plt
A = (0,0); B = (x,0); # x solves x^2 = 12^2 + (18-x)^2

7. Limitations and Prospective Directions

MatplotCode's generated diagrams do not achieve perfect geometric fidelity—certain constructs such as midpoint positioning or colinearity may be slightly misaligned. CodePlot-CoT inherits these limitations, as seen when points drift off intended loci in particular figures.

Identified future research:

  • Scaling image-code training data
  • Integrating geometric constraint feedback (e.g., enforcing colinearity explicitly via loss functions)
  • Extending capabilities to 3D plots and symbolic code manipulation

A plausible implication is that further augmentations in fidelity and expanded dataset coverage may yield additional improvements in model reasoning accuracy and applicability.

Overall, CodePlot-CoT establishes a rigorous framework for "code-as-visual-thought" reasoning in mathematical problem solving, effectively bridging structured code synthesis and visual inference for multimodal learning and substantially improving solution correctness and reasoning process scores on the Math-VR benchmark (Duan et al., 13 Oct 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to CodePlot-CoT Paradigm.