MalruleLib: Modeling Math Misconceptions
- MalruleLib is a Python framework that models 101 distinct mathematical malrules as executable procedures, capturing systematic student misconceptions.
- It generates over one million paired instances using parameterized templates and dual-path traces to support scalable evaluation and remediation of errors.
- The framework introduces the Malrule Reasoning Accuracy (MRA) benchmark, enabling precise diagnosis and prediction of misconception patterns in diverse math categories.
MalruleLib is a Python-based framework and research infrastructure for modeling systematic student misconceptions—referred to as malrules—in mathematics. Drawing on 67 learning-science and mathematics education sources, it operationalizes 101 distinct malrules as executable procedures, aligning them to parameterized problem templates and generating paired traces for both correct and misconception-driven student reasoning. MalruleLib supports controlled evaluation and scalable data generation for the diagnosis, prediction, and remediation of mathematical misconceptions using both educational AI systems and LLMs (Chen et al., 6 Jan 2026).
1. Conceptual Foundations: Malrules and Systematic Error
MalruleLib is grounded in empirical studies from mathematics education, which establish that student mistakes often follow coherent, repeatable procedures rather than being random slips. These systematic patterns—malrules, procedural bugs, or misconceptions—are documented in sources such as Brown and Burton (1978) and Siegler et al. (2012). For example, a student might repeatedly apply the erroneous procedure by adding numerators and denominators, exhibiting consistency across symbolic, contextualized, and word-problem formats.
This observation motivates the design of intelligent tutoring systems and educational AI models that can infer which malrule a student applies and predict the manifestation of that malrule on novel problems. Conventional error catalogs describe what mistakes occur but lack operational, procedure-driven representations that enable simulation, diagnosis, or prediction at scale. MalruleLib addresses this gap by encoding malrules as Pyhon modules, allowing them to be “run” on arbitrary instances and traced step-by-step.
2. Formal Structure and Executable Representation
MalruleLib implements malrules as modular Python packages, each comprising four components:
- problem_generator.py: Defines a set of parameterized templates per malrule.
- correct_algorithm.py: Outputs step-by-step correct solutions and final answer for a given instance %%%%4%%%%.
- malrule_algorithm.py: Outputs step-by-step malrule-consistent solutions, including the trace and final malrule answer .
- test_malrule.py: Contains unit tests to ensure correctness under all templates.
Templates parameterize families of problems with variables (numbers, word-problem particulars, etc.), sampling instances that satisfy constraints specific to the malrule (e.g., subtraction templates requiring borrowing). This structure allows precise control over the elicitation of malrules.
Notation for core entities:
- , : source and target instances.
- : correct answer.
- : malrule answer.
- : sequence of malrule-consistent reasoning steps on .
3. Data Generation: Templates, Dual-Path Traces, and Scalability
MalruleLib spans 498 parameterized templates distributed across 22 mathematical categories, with coverage including:
- 54 malrules in Number Operations (whole numbers, fractions, decimals, signed numbers),
- 37 in Algebra (exponents, radicals, expressions, equations, functions),
- 8 in Geometry & Measurement,
- 4 in Data & Modeling (statistics, word problems).
Each malrule averages 4.9 templates, classified across scaffold levels (basic formula 18.5%, structural variants 50.8%, contextualized 6.2%, word problems 24.5%) and context domains (e.g., symbolic: 63%, plus domains such as measurement, money, science, sports).
The trace-generation engine iterates over malrules, templates, and admissible parameter settings, producing for each instance: stepssteps. With thousands of parameter combinations per template, MalruleLib produces over one million distinct paired instances, each containing both the correct and misconception-driven solution path. This dual-path data enables targeted supervision and benchmarking for both model training and evaluation.
4. Benchmark Definition: Malrule Reasoning Accuracy (MRA) and Evaluation Tasks
MalruleLib formalizes the student-modeling challenge as Malrule Reasoning Accuracy (MRA), defined as follows:
- Given: An unknown malrule , a source instance from template (with either final malrule answer or full trace ), and a target instance (from same or different template).
- Task: Predict the malrule answer .
- Metric: Accuracy—the fraction of predictions matching via normalized algebraic or numeric comparison.
Variants include:
- Forward MRA (FMRA): Given a natural-language description and a target instance , predict .
- Correct Reasoning Accuracy (CRA): Baseline correct answer prediction .
Evaluation involves sampling within templates (same-template) and across templates (cross-template), including both answer-only and trace-based prompts. The scale of experimentation includes 4,991–7,700 pairs per task and ∼35,000 inference calls per model, totaling ∼320,000 across nine LLMs (ranging from 1.3B to 120B parameters).
5. Results: Model Performance, Trace Supervision, and Transfer Challenges
Empirical results establish quantitative gaps between direct problem solving, described misconception simulation, and cross-template malrule modeling:
| Metric | Accuracy (%) | Degradation vs. CRA (%) |
|---|---|---|
| CRA (correct solving) | 65.7 | 0 |
| FMRA | 32.3 | –33.5 |
| MRA cross-template, answer | 40.5 | –25.3 |
| MRA cross-template, steps | 46.5 | –19.2 |
| Same-template MRA, steps | 64.6 | –1.1 |
- Providing malrule traces in the prompt improves cross-template MRA by 6 points on average, with model-specific lifts between +3% and +15%.
- Accuracy drop from same-template to cross-template averages 15.6 points (answer-only) and 18.1 points (with steps), indicating that current LLMs rely heavily on template cues and underperform at abstract malrule generalization.
- Domain-wise, Functions are easiest (MRA ≈ 82%) and Coordinate Geometry hardest (≈ 29%), with substantial difficulty in Signed Numbers (35%), Expressions & Equations (45%), Fractions & Ratios (44%), and Statistics (45%).
This suggests a systematic limitation in cross-context misconception prediction for current language-model architectures.
6. Contributions, Applications, and Research Impact
MalruleLib’s principal contributions are:
- Learning-Science-Grounded Misconception Library—encoding 101 malrules traceable to 67 peer-reviewed sources, with coverage across 22 mathematical categories and 498 templates.
- Dual-Path Traces at Million-Instance Scale—enabling large-scale, trace-level supervision and data generation for model training.
- Benchmark for Cross-Template Misconception Prediction—introducing MRA as an operationalized “Educational Turing Test” (Sonkar et al. 2025), with controlled settings that reveal model limitations.
Applications include:
- Automated diagnosis of student misconceptions from partial work or single incorrect answers.
- Generation of feedback and targeted interventions based on inferred malrules, improving remediation strategies.
- Adaptive sequencing of instructional content, anticipating malrule triggers in varied contexts.
- Cross-context student modeling using executable malrules and parameterized problem formats.
- Extensible research infrastructure supporting curriculum design, cognitive modeling, and benchmarking for educational AI.
A plausible implication is that MalruleLib supports granular and scalable assessment of both student behaviors and AI model capabilities, enabling the community to advance toward AI systems that interpret, predict, and remediate mathematical misconceptions at the procedural level (Chen et al., 6 Jan 2026).