CodePlot-CoT: Code-Driven Visual Math Reasoning

Updated 14 October 2025

CodePlot-CoT is a code-driven chain-of-thought paradigm that integrates text, executable plotting code, and precise visual outputs for mathematical reasoning.
It interleaves natural language, mathematical formulas, and plotting code to generate interpretable visual figures, addressing limitations of text-only and direct image-generation models.
The approach improves mathematical problem-solving accuracy by enabling fine-grained control, robust error diagnosis, and has demonstrated up to a 21% gain over baseline methods.

CodePlot-CoT is a code-driven chain-of-thought paradigm for mathematical visual reasoning that enables large language and vision-LLMs to solve problems by interleaving text, precise plotting code, and visual outputs. This approach supplements traditional text-based reasoning with executable code-blocks that render formal geometric constructions or function plots, thus tightly integrating structured visual representations into the multimodal reasoning pipeline. CodePlot-CoT provides fine-grained control and verification over the visual reasoning process, addressing limitations of both purely text-based and direct image-generation models when solving complex mathematical problems that require the manipulation or interpretation of visual elements.

1. CodePlot-CoT Paradigm and Motivation

CodePlot-CoT was developed to address persistent bottlenecks in mathematical problem-solving with LLMs and VLMs, particularly those problems demanding formal visual reasoning—such as geometry (auxiliary lines, loci), calculus (function plots), and other domains requiring explicit visualizations. Text-only chain-of-thought (CoT) reasoning, while effective for symbolic and logical steps, cannot represent geometric or graphical relationships with adequate precision. Existing multimodal models that synthesize images in the reasoning chain often lack the controllability and semantic alignment necessary for formal mathematics, frequently generating imprecise figures incompatible with the stepwise logic of mathematics.

The central principle of CodePlot-CoT is to augment the reasoning chain by allowing the model to emit executable plotting code at critical stages. These code blocks are immediately rendered into precise, interpretable figures that form part of the model's "visual thoughts." Subsequent reasoning steps can then condition on both the rendered image and the prior chain, enabling the system to iteratively "think with images" in a controlled, verifiable manner.

2. Methodology and System Architecture

The CodePlot-CoT architecture is realized via an interleaved natural language and code-generation pipeline. At each reasoning step, the model may output (i) natural language explanations, (ii) mathematical formulas (typically in LaTeX), and/or (iii) plotting code (in e.g., Python+matplotlib syntax). When visual assistance is necessary, a code block is synthesized, executed, and the resulting image is fed back as part of the context for downstream reasoning.

A representative methodology includes:

Given a mathematical problem, the system first parses the input and initiates a chain-of-thought generation.
At decision points requiring visualization (e.g., after constructing geometric elements or before analytical steps), the LLM emits a code snippet for plotting.
The code is executed in a controlled environment, producing a figure that is appended (as tokenized pixel data or an object reference) to the reasoning context.
Subsequent steps of the CoT may reference either the textual history, the plotted figure, or both.

An example chain could include:

[Text] "Let $AD$ and $BC$ be the two segments to compare; we draw auxiliary line $A'B$ ."
[Code]
1
plt.plot(...)
[Image] [Rendered figure]
[Text] "From the construction in the previous image, $AB + BC + CD \geq A'B + BC + CD'$ , and $A'B + BC + CD' \geq A'D'$ ."

This bidirectional melding of code and image within the reasoning loop is a defining feature of CodePlot-CoT.

3. Math-VR Dataset and Benchmark

A crucial enabler for CodePlot-CoT training and benchmarking is the Math-VR dataset, which is the first large-scale, bilingual resource targeting mathematics problems requiring visual reasoning. It comprises 178,150 carefully annotated question–solution pairs in English and Chinese, with approximately 71% of samples necessitating both textual and visual reasoning. The dataset covers a wide range of mathematical domains, with a concentration on geometry (81% of samples), but also spans algebra, calculus, and statistics.

Each sample provides not just the textual prompt and answer, but also the exact plotting code and its rendered figure at appropriate CoT steps. This structure allows for precise supervision, robust automatic evaluation, and supports the development of models capable of end-to-end code-driven visual reasoning.

4. Image-to-Code Converter: MatPlotCode

To support both dataset creation and model training, CodePlot-CoT introduces MatPlotCode, a high-fidelity, domain-specialized image-to-code converter. Its core function is to parse complex mathematical figures—whether geometric diagrams or function plots—and recover the underlying plotting code. This converter is essential for:

Curating clean, code-aligned visual reasoning data automatically and at scale.
Guaranteeing round-trip consistency between code and image—a requirement for verifiable, executable visual thoughts.
Enabling effective learning of the mapping between images and symbolic code in the model.

The accuracy of MatPlotCode directly affects the alignment and quality of multimodal reasoning in CodePlot-CoT; the paper notes continued work is needed to achieve perfect fidelity.

5. Experimental Evaluation and Performance

Comprehensive experiments using the Math-VR benchmark demonstrate that CodePlot-CoT delivers up to a 21% improvement in answer correctness over baseline models. Performance is measured both by final answer accuracy (AC) and process score (PS), which quantifies the correctness of intermediate reasoning steps. Comparisons include advanced vision-language and unified multimodal systems (e.g., Gemini-2.5-Pro), with CodePlot-CoT consistently outperforming both text-only and direct image-generation approaches on tasks necessitating formal visual reasoning.

Ablation studies reveal that the code-driven visual thought paradigm not only increases reasoning precision but also facilitates error diagnosis and interpretation during evaluation, which is not possible with black-box image generators.

6. Significance, Applications, and Future Directions

CodePlot-CoT inaugurates a new methodology for “thinking with images” using code as the representational glue between vision and language. Its significance lies in:

Bridging linguistic and visual representations with controllable, executable, and interpretable constructs.
Enabling stepwise, verifiable, and correction-friendly reasoning in AI systems, especially for complex, structured mathematical domains.
Providing a blueprint for multimodal reasoning pipelines in other fields where structured diagrams or process illustrations are critical.

Future research is anticipated to improve image-to-code conversion quality, extend the methodology to additional STEM domains and languages, and to address multimodal tasks beyond mathematics. Potential developments include richer annotation protocols, expanded benchmark coverage, and broader real-world applicability in education, research, and scientific discovery.

7. Open Resources and Community Impact

The CodePlot-CoT project includes open access to datasets, the codebase, benchmarking utilities, and pretrained models at https://github.com/HKU-MMLab/Math-VR-CodePlot-CoT. These resources are intended to foster reproducibility, encourage further innovation in multimodal CoT research, and accelerate advances in the integration of language, code, and visual reasoning systems.

PDF Markdown Chat (Pro)

Follow Topic

Get notified by email when new papers are published related to CodePlot-CoT.