MalruleLib: Modeling Math Misconceptions

Updated 13 January 2026

MalruleLib is a Python framework that models 101 distinct mathematical malrules as executable procedures, capturing systematic student misconceptions.
It generates over one million paired instances using parameterized templates and dual-path traces to support scalable evaluation and remediation of errors.
The framework introduces the Malrule Reasoning Accuracy (MRA) benchmark, enabling precise diagnosis and prediction of misconception patterns in diverse math categories.

MalruleLib is a Python-based framework and research infrastructure for modeling systematic student misconceptions—referred to as malrules—in mathematics. Drawing on 67 learning-science and mathematics education sources, it operationalizes 101 distinct malrules as executable procedures, aligning them to parameterized problem templates and generating paired traces for both correct and misconception-driven student reasoning. MalruleLib supports controlled evaluation and scalable data generation for the diagnosis, prediction, and remediation of mathematical misconceptions using both educational AI systems and LLMs (Chen et al., 6 Jan 2026).

1. Conceptual Foundations: Malrules and Systematic Error

MalruleLib is grounded in empirical studies from mathematics education, which establish that student mistakes often follow coherent, repeatable procedures rather than being random slips. These systematic patterns—malrules, procedural bugs, or misconceptions—are documented in sources such as Brown and Burton (1978) and Siegler et al. (2012). For example, a student might repeatedly apply the erroneous procedure $\frac{1}{2} + \frac{1}{3} = \frac{2}{5}$ by adding numerators and denominators, exhibiting consistency across symbolic, contextualized, and word-problem formats.

This observation motivates the design of intelligent tutoring systems and educational AI models that can infer which malrule a student applies and predict the manifestation of that malrule on novel problems. Conventional error catalogs describe what mistakes occur but lack operational, procedure-driven representations that enable simulation, diagnosis, or prediction at scale. MalruleLib addresses this gap by encoding malrules as Pyhon modules, allowing them to be “run” on arbitrary instances and traced step-by-step.

2. Formal Structure and Executable Representation

MalruleLib implements $|M|=101$ malrules as modular Python packages, each comprising four components:

problem_generator.py: Defines a set of parameterized templates $T_m=\{t_1,...,t_K\}$ per malrule.
correct_algorithm.py: Outputs step-by-step correct solutions and final answer $a_c(i)$ for a given instance $i$ .
malrule_algorithm.py: Outputs step-by-step malrule-consistent solutions, including the trace $S_m(i)$ and final malrule answer $a_m(i)$ .
test_malrule.py: Contains unit tests to ensure correctness under all templates.

Templates parameterize families of problems with variables $\theta$ (numbers, word-problem particulars, etc.), sampling instances $i \sim t$ that satisfy constraints specific to the malrule (e.g., subtraction templates requiring borrowing). This structure allows precise control over the elicitation of malrules.

Notation for core entities:

$i_s$ , $|M|=101$ 0: source and target instances.
$|M|=101$ 1: correct answer.
$|M|=101$ 2: malrule answer.
$|M|=101$ 3: sequence of malrule-consistent reasoning steps on $|M|=101$ 4.

3. Data Generation: Templates, Dual-Path Traces, and Scalability

MalruleLib spans 498 parameterized templates distributed across 22 mathematical categories, with coverage including:

54 malrules in Number Operations (whole numbers, fractions, decimals, signed numbers),
37 in Algebra (exponents, radicals, expressions, equations, functions),
8 in Geometry & Measurement,
4 in Data & Modeling (statistics, word problems).

Each malrule averages 4.9 templates, classified across scaffold levels (basic formula 18.5%, structural variants 50.8%, contextualized 6.2%, word problems 24.5%) and context domains (e.g., symbolic: 63%, plus domains such as measurement, money, science, sports).

The trace-generation engine iterates over malrules, templates, and admissible parameter settings, producing for each instance: $|M|=101$ 5steps $|M|=101$ 6steps $|M|=101$ 7. With thousands of parameter combinations per template, MalruleLib produces over one million distinct paired instances, each containing both the correct and misconception-driven solution path. This dual-path data enables targeted supervision and benchmarking for both model training and evaluation.

4. Benchmark Definition: Malrule Reasoning Accuracy (MRA) and Evaluation Tasks

MalruleLib formalizes the student-modeling challenge as Malrule Reasoning Accuracy (MRA), defined as follows:

Given: An unknown malrule $|M|=101$ 8, a source instance $|M|=101$ 9 from template $T_m=\{t_1,...,t_K\}$ 0 (with either final malrule answer $T_m=\{t_1,...,t_K\}$ 1 or full trace $T_m=\{t_1,...,t_K\}$ 2), and a target instance $T_m=\{t_1,...,t_K\}$ 3 (from same or different template).
Task: Predict the malrule answer $T_m=\{t_1,...,t_K\}$ 4.
Metric: Accuracy—the fraction of predictions matching $T_m=\{t_1,...,t_K\}$ 5 via normalized algebraic or numeric comparison.

Variants include:

Forward MRA (FMRA): Given a natural-language description $T_m=\{t_1,...,t_K\}$ 6 and a target instance $T_m=\{t_1,...,t_K\}$ 7, predict $T_m=\{t_1,...,t_K\}$ 8.
Correct Reasoning Accuracy (CRA): Baseline correct answer prediction $T_m=\{t_1,...,t_K\}$ 9.

Evaluation involves sampling within templates (same-template) and across templates (cross-template), including both answer-only and trace-based prompts. The scale of experimentation includes 4,991–7,700 pairs per task and ∼35,000 inference calls per model, totaling ∼320,000 across nine LLMs (ranging from 1.3B to 120B parameters).

5. Results: Model Performance, Trace Supervision, and Transfer Challenges

Empirical results establish quantitative gaps between direct problem solving, described misconception simulation, and cross-template malrule modeling:

Metric	Accuracy (%)	Degradation vs. CRA (%)
CRA (correct solving)	65.7	0
FMRA	32.3	–33.5
MRA cross-template, answer	40.5	–25.3
MRA cross-template, steps	46.5	–19.2
Same-template MRA, steps	64.6	–1.1

Providing malrule traces in the prompt improves cross-template MRA by 6 points on average, with model-specific lifts between +3% and +15%.
Accuracy drop from same-template to cross-template averages 15.6 points (answer-only) and 18.1 points (with steps), indicating that current LLMs rely heavily on template cues and underperform at abstract malrule generalization.
Domain-wise, Functions are easiest (MRA ≈ 82%) and Coordinate Geometry hardest (≈ 29%), with substantial difficulty in Signed Numbers (35%), Expressions & Equations (45%), Fractions & Ratios (44%), and Statistics (45%).

This suggests a systematic limitation in cross-context misconception prediction for current language-model architectures.

6. Contributions, Applications, and Research Impact

MalruleLib’s principal contributions are:

Learning-Science-Grounded Misconception Library—encoding 101 malrules traceable to 67 peer-reviewed sources, with coverage across 22 mathematical categories and 498 templates.
Dual-Path Traces at Million-Instance Scale—enabling large-scale, trace-level supervision and data generation for model training.
Benchmark for Cross-Template Misconception Prediction—introducing MRA as an operationalized “Educational Turing Test” (Sonkar et al. 2025), with controlled settings that reveal model limitations.

Applications include:

Automated diagnosis of student misconceptions from partial work or single incorrect answers.
Generation of feedback and targeted interventions based on inferred malrules, improving remediation strategies.
Adaptive sequencing of instructional content, anticipating malrule triggers in varied contexts.
Cross-context student modeling using executable malrules and parameterized problem formats.
Extensible research infrastructure supporting curriculum design, cognitive modeling, and benchmarking for educational AI.

A plausible implication is that MalruleLib supports granular and scalable assessment of both student behaviors and AI model capabilities, enabling the community to advance toward AI systems that interpret, predict, and remediate mathematical misconceptions at the procedural level (Chen et al., 6 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

MalruleLib: Large-Scale Executable Misconception Reasoning with Step Traces for Modeling Student Thinking in Mathematics (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MalruleLib.

MalruleLib: Modeling Math Misconceptions

1. Conceptual Foundations: Malrules and Systematic Error

2. Formal Structure and Executable Representation

3. Data Generation: Templates, Dual-Path Traces, and Scalability

4. Benchmark Definition: Malrule Reasoning Accuracy (MRA) and Evaluation Tasks

5. Results: Model Performance, Trace Supervision, and Transfer Challenges

6. Contributions, Applications, and Research Impact

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

MalruleLib: Modeling Math Misconceptions

1. Conceptual Foundations: Malrules and Systematic Error

2. Formal Structure and Executable Representation

3. Data Generation: Templates, Dual-Path Traces, and Scalability

4. Benchmark Definition: Malrule Reasoning Accuracy (MRA) and Evaluation Tasks

5. Results: Model Performance, Trace Supervision, and Transfer Challenges

6. Contributions, Applications, and Research Impact

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research