CodePlot-CoT Paradigm
- CodePlot-CoT is a code-driven Chain-of-Thought paradigm that integrates executable plotting code to generate visual 'thoughts' for precise mathematical reasoning.
- It interleaves natural language and code generation, executing Python plotting code to produce graphical representations that inform subsequent inference.
- Empirical results on the Math-VR benchmark indicate significant gains in process scores and answer correctness, validating its practical effectiveness.
CodePlot-CoT is a code-driven Chain-of-Thought paradigm for mathematical visual reasoning in large language and vision–LLMs. Rather than relying solely on textual inference or pixel-based image generation, CodePlot-CoT integrates executable plotting code into the reasoning process, generating compact, precise graphical representations at each step. These rendered images act as intermediate "visual thoughts," informing further model reasoning and enabling the resolution of problems requiring geometric or functional visualization (Duan et al., 13 Oct 2025).
1. Motivation and Conceptual Foundations
Traditional LLMs process mathematical reasoning via purely textual Chain-of-Thought (CoT), which becomes inadequate for visually grounded tasks such as inserting auxiliary lines, analyzing diagram features, or constructing plots for . Existing multimodal VLMs capable of generating images often lack geometric fidelity, as pixel-level synthesis (diffusion models or auto-regressive visual token approaches) is imprecise in angle, length, or colinearity, leading to errors in structured mathematical diagrams. The CodePlot-CoT paradigm addresses these issues by treating executable code (primarily Python using matplotlib) as the medium for generating intermediate images, thereby leveraging the syntactic rigor and expressive power of code for mathematical visualization. This shifts the difficulty from vision-centric image synthesis to code generation, an area where LLMs exhibit strong capabilities.
In the paradigm, each reasoning stage may prompt the model to emit plotting code, which is programmatically executed to produce images. These images are re-embedded and provided back to the model, serving as ground-truth visual hints for subsequent inference processes.
2. Architecture and Inference Workflow
The CodePlot-CoT inference process interleaves three key components:
- Text Reasoner: A VLM generates natural-language reasoning steps.
- Code Generator: The same VLM emits Python plotting code when visual assistance is required.
- Code Executor and Renderer: This component executes the code and returns a rendered PNG image.
These modules are orchestrated as follows:
1 2 3 4 5 6 7 8 9 |
Question + (optional) Input Figure
↓
[VLM] → NATURAL LANGUAGE STEP_1
↓
If “visual needed” → CODE BLOCK_1 → [Executor] → IMAGE_1 → Vision Encoder Embedding
↓
[VLM] → NATURAL LANGUAGE STEP_2
↓
... iterate until final answer |
Pseudocode outlining the multi-stage inference:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
Algorithm 1: CodePlot-CoT Inference
Input: Question Q, optional diagram I₀
context ← [Q, I₀]
for t = 1…T do
token ← VLM.generate(context)
if token starts_code_delimiter then
code_block ← VLM.generate_code(context)
image_t ← EXECUTOR.render(code_block)
context.append(image_t)
else
text_token ← token
context.append(text_token)
end
if “<END_OF_REASONING>” in context: break
end
answer ← VLM.extract_answer(context)
Return answer |
Example plotting code generated for visual reasoning (circle of radius 2):
1 2 3 4 5 6 7 8 9 |
import numpy as np import matplotlib.pyplot as plt theta = np.linspace(0, 2*np.pi, 200) x = 2*np.cos(theta); y = 2*np.sin(theta) plt.figure(figsize=(4,4)) plt.plot(x,y,'k-') plt.axis('equal') plt.axis('off') plt.savefig('fig.png', dpi=100) |
3. Math-VR Dataset: Construction and Statistics
The Math-VR dataset underpins CodePlot-CoT training and evaluation. Initially, 900K high-school mathematical problems featuring at least one image in the solution were collected. After filtering and standardization via GPT-4.1, Math-VR comprises 178,150 bilingual (English and Chinese) question–answer pairs, with a split of approximately 173K training and 5K testing instances.
Coverage by modality and domain includes:
- Modalities: 29% text-only; 71% multimodal (figures required in reasoning).
- Domains: Geometry (81%), Algebra (13%), Calculus (4%), Statistics (2%).
- Geometry Subtypes: Triangle, Circle, Quadrilateral, Area, Perimeter, etc.
Table: Key Dataset Statistics
| Metric | Minimum | Maximum | Average |
|---|---|---|---|
| Question length (tokens) | 9 | 602 | 144.2 |
| Solution length (tokens) | 46 | 2753 | 591.1 |
| Images per Question | - | 4 | 1.04 |
| Images per Solution | - | 7 | 1.24 |
Example Problem (Geometry):
- Q: “AB is tangent to ⊙O at B; extension of AO meets ⊙O at C. If ∠A=45° and AB=2, find AC.”
- Reasoning process includes generating a right-triangle plot and measuring relevant sides.
4. Image-to-Code Conversion: MatplotCode
MatplotCode is a specialized image-to-code converter that transforms existing mathematical figures (PNG/JPG) into executable plotting code. Its base dataset, ImgCode-8.6M (MathCoder-VL), is filtered to ~1M high-fidelity geometric figures with paired Python codes.
Training proceeds in two stages using Qwen2.5VL-32B:
- Vision-Encoder Alignment: Train Vision Transformer (ViT) + MLP projector for one epoch, freezing the LM head.
- Full Fine-Tuning: Unfreeze and train all weights for two additional epochs.
Architecture components:
- Vision backbone: ViT
- Multimodal adapter: projects visual features to LM latent space
- LM head: autoregressive text+code generation
Training objective—cross-entropy loss over code tokens:
where are code tokens.
5. Training Pipeline and Optimization Methodology
CodePlot-CoT initialization leverages MatplotCode post vision-alignment. A supervised fine-tuning set (SFT) is curated by applying MatplotCode to Math-VR images and filtering via GPT-4.1. Training proceeds for 5,000 steps:
- Batch size: 256
- Learning rate:
- Hardware: 32 × NVIDIA H200 GPUs, ~36 h runtime
Multi-task loss at each decoding step:
where
The default is 1.
6. Evaluation and Empirical Performance
On 2,500 English Math-VR test questions, CodePlot-CoT (32B) demonstrates notable improvements versus its base model (Qwen2.5-VL-32B):
| Metric | Base Model | CodePlot-CoT | Δ |
|---|---|---|---|
| Process Score (PS) | 33.7 | 47.0 | +13.3 |
| Answer Correctness | 10.0% | 22.1% | +12.1pp |
A qualitative example: for the problem "isosceles triangle perimeter 36 cm, altitude to base 12 cm → cos B?", the system emits code to draw the triangle, executes the plot, augments its context with the rendered image, and then completes the algebraic reasoning, ultimately generating the answer . CodePlot-CoT narrows the multimodal reasoning gap with Gemini-2.5-Pro by providing verifiable visual steps.
Example generated plotting code:
1 2 |
import matplotlib.pyplot as plt A = (0,0); B = (x,0); # x solves x^2 = 12^2 + (18-x)^2 |
7. Limitations and Prospective Directions
MatplotCode's generated diagrams do not achieve perfect geometric fidelity—certain constructs such as midpoint positioning or colinearity may be slightly misaligned. CodePlot-CoT inherits these limitations, as seen when points drift off intended loci in particular figures.
Identified future research:
- Scaling image-code training data
- Integrating geometric constraint feedback (e.g., enforcing colinearity explicitly via loss functions)
- Extending capabilities to 3D plots and symbolic code manipulation
A plausible implication is that further augmentations in fidelity and expanded dataset coverage may yield additional improvements in model reasoning accuracy and applicability.
Overall, CodePlot-CoT establishes a rigorous framework for "code-as-visual-thought" reasoning in mathematical problem solving, effectively bridging structured code synthesis and visual inference for multimodal learning and substantially improving solution correctness and reasoning process scores on the Math-VR benchmark (Duan et al., 13 Oct 2025).