WizardMath: Math LLMs & CAS Interface

Updated 13 November 2025

WizardMath is a family of open-source large language models and an interactive symbolic system designed for advanced mathematical problem solving and stepwise reasoning.
It utilizes a three-phase training process with supervised fine-tuning, reward model training, and PPO-driven active instruction evolution to optimize chain-of-thought reasoning.
The platform also features a user-centric interface layer on traditional CAS engines, enabling dynamic, granular symbolic manipulation for transformation-based math tasks.

WizardMath is a family of open-source LLMs and interactive symbolic systems targeting advanced mathematical problem solving and reasoning. The term encompasses both (a) transformer-based LLMs fine-tuned for mathematical chain-of-thought (CoT) reasoning via reinforcement learning from evolved instructions, and (b) an interactive computer algebra “wizard” interface for guided symbolic manipulation. The distinguishing feature of WizardMath LLMs is the Reinforced Evol-Instruct Feedback (RLEIF) procedure that couples synthetic instruction evolution with process-level reward modeling to systematically improve stepwise reasoning without external calculators, while the WizardMath interface layer on CAS engines provides a user-centric environment for granular, transformation-based manipulation.

1. Model Architecture and Core Methodology

WizardMath LLMs are built atop the LLaMA-2 transformer backbone, provided in 7B, 13B, and 70B parameter versions. No architectural changes are made to the core self-attention or feedforward blocks; instead, a lightweight policy/value head is added for reinforcement updates. The WizardMath-7B, for example, uses 32 transformer layers, hidden dimension $d=4096$ , and 32 attention heads.

The signature RLEIF pipeline comprises three phases:

@@@@2@@@@ (SFT):
- Re-generation of stepwise math solutions for GSM8k and MATH datasets using WizardLM-70B, filtering to ensure correctness (~15k math chains).
- Inclusion of 1.5k open-domain, instruction-style samples for diversity.
Reward Model Training:
- Instruction Reward Model (IRM): Ranks quality of evolved math instructions ( $r^I = \mathrm{IRM}(i)$ ), using ChatGPT+Wizard-E annotation and pairwise ranking.
- Process Reward Model (PRM): Ranks stepwise correctness within chains-of-thought ( $r^A = \mathrm{PRM}(\mathrm{answer})$ ), trained on per-step correctness labels from ChatGPT.
Active Instruction Evolution and PPO:
- For 8 rounds, Wizard-E (an SFT’d LLaMA model) generates 2–4 upward (harder) or downward (easier) variants per instruction. IRM ranks these variants, with the best added to the dataset, expanding from $\sim$ 15k to $\sim$ 96k tasks.
- The fine-tuned LLM policy is updated with Proximal Policy Optimization (PPO) using the reward $r_j = r^I_j \times r^A_j$ on each instruction-answer batch.

The process can be formally summarized by the following loss: $L_{\mathrm{PPO}}(\theta) = \mathbb{E}_t \Big[ \min\!\big(\rho_t(\theta) \, A_t,\;\mathrm{clip}\!\big(\rho_t(\theta),\,1-\epsilon,\,1+\epsilon\big)\,A_t\big)\Big]$ where $\rho_t(\theta) = \frac{\pi_\theta(a_t|i_t)}{\pi_\text{old}(a_t|i_t)}$ and $A_t$ is the advantage (here, discounted sum of rewards minus baseline).

Instruction Evolution (“Evol-Instruct”) systematically grades task complexity: downward evolution simplifies instructions, upward evolution compounds constraints. PRM-driven process supervision is integral for producing syntactically and logically coherent intermediate reasoning steps.

2. Training Datasets, Supervision, and Fine-Tuning Strategies

Seed Data: GSM8k (7.5k train/1.3k test) and MATH (7.5k train/5k test), supplemented with open-domain instructions.
Instruction Expansion: Evol-Instruct process yields a 96k instruction-answer corpus.
Reward Model Supervision: Both IRM and PRM leverage ChatGPT for initial pairwise and per-step supervision; however, model training and inference rely solely on open-source LLaMA-style architectures.
Optimization: AdamW is used in SFT phases; PPO in RL phases. In Mathify (Anand et al., 19 Apr 2024), QLoRA (4-bit quantization + LoRA adapters) is adopted for scaling.
Curriculum Learning (Anand et al., 24 Dec 2024): WizardMath-7B employs a two-stage schedule, fine-tuning first on “easy” items, then on “medium/hard” in succession; problem difficulties are human-annotated (Fleiss’ $\kappa=0.58$ ).

Supervised fine-tuning objective: $\mathcal{L}_{CE} = -\sum_{t=1}^T y_t \log p_\theta(y_t\mid y_{<t},x)$

Chain-of-thought prompting is standard (“Let’s think step by step.” for zero-shot; few-shot mixes 4–8 exemplars).

3. Quantitative Evaluation and Comparative Performance

WizardMath LLMs are evaluated on GSM8k (grade-school arithmetic) and MATH (competition problems, 7 topics), using pass@1 (exact answer match with CoT under greedy decoding) as the principal metric.

Table: pass@1 (%) on GSM8k and MATH (Luo et al., 2023)

Model	Params	GSM8k	MATH
GPT-4	–	92.0	42.5
Claude 2	–	88.0	–
PaLM 2	540B	84.7	33.2
ChatGPT (3.5)	–	80.8	34.1
Llama 2 70B	70B	56.8	13.5
Falcon 40B	40B	19.6	2.5
MPT 30B	30B	15.2	3.1
WizardMath 70B	70B	81.6	22.7
WizardMath 13B	13B	63.9	14.0
WizardMath 7B	7B	54.9	10.7

MATH subtopic breakdown (70B):

Topic	Accuracy (%)
Prealgebra	41.7
Algebra	33.3
Counting & Probability	17.3
Number Theory	16.3
Geometry	15.7
Precalculus	12.6
Intermediate Algebra	7.1
Overall	22.7

On English GSM8k, WizardMath-70B (81.6%) matches or exceeds ChatGPT-3.5 and Claude Instant, with substantial improvements over vanilla open-source LLMs at similar scale. Across MATH, WizardMath-70B (22.7%) surpasses text-davinci-002 and is competitive given the absence of external tool-calls at inference. In “Mathify” (Anand et al., 19 Apr 2024), fine-tuned WizardMath-13B achieves 20.1% on MathQuest (NCERT test set), outperformed by MAmmoTH-13B (24.0%) but showing substantial improvement post-fine-tuning.

In multilingual benchmarks (Anand et al., 24 Dec 2024), WizardMath-7B achieves 61% average accuracy in English (EMKB) and 54% in Hindi (HMKB), outperforming Gemini Pro (by +6 pp in English) and matching closed-source results in Hindi at a significantly lower parameter budget. GSM8k results: WizardMath (80%) vs. Gemini Pro (75%); MATH: 45% vs. 39%.

4. Instruction Evolution, Process Supervision, and Reward Composition

Instruction Evolution:

Core to RLEIF. Upward evolution makes tasks harder by adding constraints or steps; downward evolution simplifies instructions.
Each instruction ( $i$ ) spawns 2–4 variants per round via Wizard-E, with IRM ranking.
Eight rounds yield multi-level instructional complexity.

Process Supervision:

Instead of rewarding only final-answer correctness, PRM encourages each intermediate reasoning step to be correct.
This enforces logical stepwise consistency and reduces hallucinated or incoherent chains.

Reward Structure:

Joint reward: $r = r^I \cdot r^A$ , combining instruction (interpretation/understanding) and answer (execution) fidelity.
PPO updates with the above, promoting both task comprehension and execution accuracy.

Ablation Behavior:

Eliminating IRM (instruction reward) leads to overfitting to trivial or edge-case variants.
Without PRM (process reward), chains drift—final answer correctness is decoupled from stepwise validity.
Each RLEIF component yields incremental gains; their combination produces the observed $\sim$ 25 pp improvement on GSM8k over Llama-2 70B (Luo et al., 2023).

5. Multilingual and Curriculum Learning Frameworks

Multilingual Capabilities:

WizardMath-7B is adapted for both English and Hindi by binning datasets (EMKB = English; HMKB = Hindi) into Easy/Medium/Hard problems.
English ↔ Hindi translation is done via LLAMA-3 (405B) and human curation.
A 1:1 parallel corpus is constructed and augmented via GPT-4 expansion, with verification.

Curriculum Learning:

Problems are staged for training: first (70% of Easy), then chain-trained on Medium, finally exposed to Hard test items.
Each problem is annotated for difficulty; this staged approach shows improvements in arithmetic ability and generalization.

Novel Decomposition Strategy:

Multi-digit arithmetic is decomposed by place value: $a = \sum_i d_i 10^i$ , thus multiplication $a\times b = \sum_i (d_i 10^i)\times b$ , division $a / b = \sum_i \frac{d_i 10^i}{b}$ (multiplicative case example).

Structured Solution Design:

WizardMath-7B (in multilingual contexts) is fine-tuned to produce solutions in six phases: Data Identification, Problem Analysis, Theoretical Framework, Methodology Development, Computation, Answer.

Illustrative example for a hyperbola problem:

$\text{Foci}=(0,\pm3), \ \text{Vertices}=(0,\pm\frac{\sqrt{11}}{2}) \rightarrow \frac{y^2}{11/4}-\frac{x^2}{25/4}=1 \implies 100y^2 -44x^2=275$

6. User Interface Layer: Interactive Symbolic Wizard

Separate from the LLM pipeline, a “WizardMath” interface is proposed as an interactive layer on top of traditional computer algebra system (CAS) engines (Stoutemyer, 2013). This interface is characterized by:

Model–View–Controller Design: The controller computes applicability predicates, schedules lightweight previews, builds a tree of transformation alternatives, and handles undo/redo.
Dynamic Dialogs: Context-sensitive popups for subexpression selection, transformation choice (e.g., factor, expand, partial-fraction), and output style (input-result, derivation step, or in-situ replacement).
Direct Manipulation: Highlighting, dragging for subexpression grouping and partitioning, and sliders/toggles for intermediate result navigation.
Transformation Organization: Alternatives ranked by estimated benefit–cost heuristics; categories partially-ordered by a small DAG; previews given in LaTeX with ellipsis elisions for long subterms.
Efficient Exploration: By interleaving variable choice and top-K transformation choices, the interface avoids combinatorial explosion, visiting at most $\sum_{i=0}^{|G|}K^i$ nodes per session for $|G|$ variables and $K$ alternatives per choice.
Exploration Features: Unlimited backtracking, session trees, result accumulation, and click-to-expand for elided terms.

7. Limitations and Future Directions

WizardMath, despite its substantial improvement over comparable open-source models, does not fully close the gap with proprietary giants on the most challenging benchmarks: GSM8k—WizardMath 70B (81.6%) vs. GPT-4 (92%) and Claude 2 (88%). Performance in NCERT-style (MathQuest) problems tops at 20.1% for WizardMath-13B, trailing MAmmoTH-13B (24.0%) (Anand et al., 19 Apr 2024). Limitations are ascribed to:

Dependence on ChatGPT for IRM/PRM supervision; further gains may require human-annotated process signals.
Potentially suboptimal curriculum or instructional coverage—not all rounds of evolution may be equally informative; scheduling remains an open area.
Limited coverage of domains such as symbolic integration/proof or advanced calculus.

Proposed directions include enhanced curriculum learning, incorporation of symbolic engines for arithmetic, and extension to non-English or multi-modal mathematical domains (Anand et al., 24 Dec 2024). The combination of diverse, difficulty-calibrated instruction generation and rigorous process supervision remains central to WizardMath’s effectiveness in LLM mathematical reasoning.