Papers
Topics
Authors
Recent
Search
2000 character limit reached

MalruleLib: Modeling Math Misconceptions

Updated 13 January 2026
  • MalruleLib is a Python framework that models 101 distinct mathematical malrules as executable procedures, capturing systematic student misconceptions.
  • It generates over one million paired instances using parameterized templates and dual-path traces to support scalable evaluation and remediation of errors.
  • The framework introduces the Malrule Reasoning Accuracy (MRA) benchmark, enabling precise diagnosis and prediction of misconception patterns in diverse math categories.

MalruleLib is a Python-based framework and research infrastructure for modeling systematic student misconceptions—referred to as malrules—in mathematics. Drawing on 67 learning-science and mathematics education sources, it operationalizes 101 distinct malrules as executable procedures, aligning them to parameterized problem templates and generating paired traces for both correct and misconception-driven student reasoning. MalruleLib supports controlled evaluation and scalable data generation for the diagnosis, prediction, and remediation of mathematical misconceptions using both educational AI systems and LLMs (Chen et al., 6 Jan 2026).

1. Conceptual Foundations: Malrules and Systematic Error

MalruleLib is grounded in empirical studies from mathematics education, which establish that student mistakes often follow coherent, repeatable procedures rather than being random slips. These systematic patterns—malrules, procedural bugs, or misconceptions—are documented in sources such as Brown and Burton (1978) and Siegler et al. (2012). For example, a student might repeatedly apply the erroneous procedure 12+13=25\frac{1}{2} + \frac{1}{3} = \frac{2}{5} by adding numerators and denominators, exhibiting consistency across symbolic, contextualized, and word-problem formats.

This observation motivates the design of intelligent tutoring systems and educational AI models that can infer which malrule a student applies and predict the manifestation of that malrule on novel problems. Conventional error catalogs describe what mistakes occur but lack operational, procedure-driven representations that enable simulation, diagnosis, or prediction at scale. MalruleLib addresses this gap by encoding malrules as Pyhon modules, allowing them to be “run” on arbitrary instances and traced step-by-step.

2. Formal Structure and Executable Representation

MalruleLib implements M=101|M|=101 malrules as modular Python packages, each comprising four components:

  • problem_generator.py: Defines a set of parameterized templates Tm={t1,...,tK}T_m=\{t_1,...,t_K\} per malrule.
  • correct_algorithm.py: Outputs step-by-step correct solutions and final answer ac(i)a_c(i) for a given instance %%%%4%%%%.
  • malrule_algorithm.py: Outputs step-by-step malrule-consistent solutions, including the trace Sm(i)S_m(i) and final malrule answer am(i)a_m(i).
  • test_malrule.py: Contains unit tests to ensure correctness under all templates.

Templates parameterize families of problems with variables θ\theta (numbers, word-problem particulars, etc.), sampling instances iti \sim t that satisfy constraints specific to the malrule (e.g., subtraction templates requiring borrowing). This structure allows precise control over the elicitation of malrules.

Notation for core entities:

  • isi_s, iti_t: source and target instances.
  • ac(i)a_c(i): correct answer.
  • am(i)a_m(i): malrule answer.
  • Sm(i)S_m(i): sequence of malrule-consistent reasoning steps on ii.

3. Data Generation: Templates, Dual-Path Traces, and Scalability

MalruleLib spans 498 parameterized templates distributed across 22 mathematical categories, with coverage including:

  • 54 malrules in Number Operations (whole numbers, fractions, decimals, signed numbers),
  • 37 in Algebra (exponents, radicals, expressions, equations, functions),
  • 8 in Geometry & Measurement,
  • 4 in Data & Modeling (statistics, word problems).

Each malrule averages 4.9 templates, classified across scaffold levels (basic formula 18.5%, structural variants 50.8%, contextualized 6.2%, word problems 24.5%) and context domains (e.g., symbolic: 63%, plus domains such as measurement, money, science, sports).

The trace-generation engine iterates over malrules, templates, and admissible parameter settings, producing for each instance: i,\langle i,stepsc,ac,_c, a_c,stepsm,am,m,t,θ_m, a_m, m, t, \theta \rangle. With thousands of parameter combinations per template, MalruleLib produces over one million distinct paired instances, each containing both the correct and misconception-driven solution path. This dual-path data enables targeted supervision and benchmarking for both model training and evaluation.

4. Benchmark Definition: Malrule Reasoning Accuracy (MRA) and Evaluation Tasks

MalruleLib formalizes the student-modeling challenge as Malrule Reasoning Accuracy (MRA), defined as follows:

  • Given: An unknown malrule mMm\in M, a source instance isi_s from template t1Tmt_1\in T_m (with either final malrule answer am(is)a_m(i_s) or full trace Sm(is)S_m(i_s)), and a target instance iti_t (from same or different template).
  • Task: Predict the malrule answer am(it)a_m(i_t).
  • Metric: Accuracy—the fraction of predictions matching am(it)a_m(i_t) via normalized algebraic or numeric comparison.

Variants include:

  • Forward MRA (FMRA): Given a natural-language description D(m)D(m) and a target instance ii, predict am(i)a_m(i).
  • Correct Reasoning Accuracy (CRA): Baseline correct answer prediction ac(i)a_c(i).

Evaluation involves sampling within templates (same-template) and across templates (cross-template), including both answer-only and trace-based prompts. The scale of experimentation includes 4,991–7,700 pairs per task and ∼35,000 inference calls per model, totaling ∼320,000 across nine LLMs (ranging from 1.3B to 120B parameters).

5. Results: Model Performance, Trace Supervision, and Transfer Challenges

Empirical results establish quantitative gaps between direct problem solving, described misconception simulation, and cross-template malrule modeling:

Metric Accuracy (%) Degradation vs. CRA (%)
CRA (correct solving) 65.7 0
FMRA 32.3 –33.5
MRA cross-template, answer 40.5 –25.3
MRA cross-template, steps 46.5 –19.2
Same-template MRA, steps 64.6 –1.1
  • Providing malrule traces in the prompt improves cross-template MRA by 6 points on average, with model-specific lifts between +3% and +15%.
  • Accuracy drop from same-template to cross-template averages 15.6 points (answer-only) and 18.1 points (with steps), indicating that current LLMs rely heavily on template cues and underperform at abstract malrule generalization.
  • Domain-wise, Functions are easiest (MRA ≈ 82%) and Coordinate Geometry hardest (≈ 29%), with substantial difficulty in Signed Numbers (35%), Expressions & Equations (45%), Fractions & Ratios (44%), and Statistics (45%).

This suggests a systematic limitation in cross-context misconception prediction for current language-model architectures.

6. Contributions, Applications, and Research Impact

MalruleLib’s principal contributions are:

  1. Learning-Science-Grounded Misconception Library—encoding 101 malrules traceable to 67 peer-reviewed sources, with coverage across 22 mathematical categories and 498 templates.
  2. Dual-Path Traces at Million-Instance Scale—enabling large-scale, trace-level supervision and data generation for model training.
  3. Benchmark for Cross-Template Misconception Prediction—introducing MRA as an operationalized “Educational Turing Test” (Sonkar et al. 2025), with controlled settings that reveal model limitations.

Applications include:

  • Automated diagnosis of student misconceptions from partial work or single incorrect answers.
  • Generation of feedback and targeted interventions based on inferred malrules, improving remediation strategies.
  • Adaptive sequencing of instructional content, anticipating malrule triggers in varied contexts.
  • Cross-context student modeling using executable malrules and parameterized problem formats.
  • Extensible research infrastructure supporting curriculum design, cognitive modeling, and benchmarking for educational AI.

A plausible implication is that MalruleLib supports granular and scalable assessment of both student behaviors and AI model capabilities, enabling the community to advance toward AI systems that interpret, predict, and remediate mathematical misconceptions at the procedural level (Chen et al., 6 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MalruleLib.