Malrule Reasoning Accuracy (MRA)

Updated 13 January 2026

MRA is a metric that quantifies a system’s ability to identify and predict error patterns (malrules) from a single worked mistake in student responses.
It employs a controlled evaluation framework using 101 malrules and nearly 500 templates to assess cross-template generalization and the benefits of using reasoning traces.
Empirical results reveal domain-dependent performance variations, highlighting the importance of stepwise supervision in diagnosing student mathematical thinking.

Malrule Reasoning Accuracy (MRA) quantifies the ability of a system—typically a LLM or reasoning engine—to infer a student’s underlying misconception from a single worked mistake and predict consistent errorful responses under context, template, and surface form variation. Unlike standard correctness metrics, MRA isolates diagnostic reasoning with respect to systematic erroneous procedures (“malrules”) and operationalizes the challenge faced by educational AI in modeling, predicting, and, ultimately, diagnosing student mathematical thinking. MRA was formalized in the context of MalruleLib, a scalable infrastructure capturing 101 executable malrules and over a million parameterized mathematical problem instances, providing tightly controlled evaluation for both direct and cross-template misconception generalization (Chen et al., 6 Jan 2026).

1. Formal Definition and Mathematical Framework

MRA is formally defined as follows. Let $\mathcal{M}$ denote the set of malrules (with $|\mathcal{M}| = 101$ ). Each malrule $m \in \mathcal{M}$ is instantiated by a set of templates $\mathcal{T}_m$ (averaging 4.9 per malrule, total 498 templates). For instance $i \sim t$ (i.e., sampled from template $t$ by parameter assignment), let $a_c(i)$ denote the correct final answer, $a_m(i)$ the malrule-induced answer, and $S_c(i)$ , $S_m(i)$ the sequences of stepwise reasoning (traces) yielding $a_c(i)$ and $a_m(i)$ respectively.

Given:

Source instance $i_s$ with malrule answer $a_m(i_s)$ (and optionally malrule trace $S_m(i_s)$ )
Target instance $i_t$

The predictive task is to generate $\hat{a}$ : the answer the student would produce on $i_t$ given the same malrule $m$ as demonstrated in $i_s$ . The metric is the fraction of prompt pairs $(i_s, i_t)$ on which the model exactly predicts $a_m(i_t)$ : $\mathrm{MRA} = \frac{1}{|P|} \sum_{(i_s, i_t) \in P} \mathbf{1}[\,\mathrm{model}(i_s, a_m(i_s), \dots, i_t) = a_m(i_t)\,]$ Variants include: same-template answer-only, same-template with steps, cross-template answer-only, and cross-template with steps. In cross-template evaluation, $i_s \sim t_1$ , $i_t \sim t_2$ with $t_1 \ne t_2$ , challenging models to generalize the error pattern under template-level distribution shift.

A related metric, Forward MRA (FMRA), assesses prediction when the model is instead given a natural-language malrule description $D(m)$ and tasked to apply it to new instances.

2. Experimental Methodology and Setup

MalruleLib operationalizes MRA using a controlled evaluation pipeline:

Worked Mistake Inference: The system receives a source solution ( $i_s$ , $a_m(i_s)$ , optionally $S_m(i_s)$ ) and must infer which malrule $m$ generated the observed error, abstaining from direct access to $D(m)$ .
Cross-Template Rephrasing: For each malrule, instances are sampled from multiple templates covering diverse algebraic forms, word problems, contextual domains (notational, money, measurement, temporal), and difficulty variants.
Paired Reasoning Traces: For every instance, the platform executes correct and malrule algorithms to produce dual-path traces, ensuring stepwise alignment for granular analysis.

The evaluation pool $P$ comprises several thousand source-target pairs systematically stratified by malrule and template conditions.

3. MalruleLib Infrastructure and Data Generation

MalruleLib encapsulates each malrule with the following modular structure:

problem_generator.py: parameterized problem templates $\mathcal{T}_m$
correct_algorithm.py: algorithmically faithful solution path (trace and answer)
malrule_algorithm.py: systematic erroneous solution path adhering to $m$
test_malrule.py: unit tests confirming dual-trace validity

Template instantiation employs grade-banded parameter ranges and constraints to guarantee malrule triggering (e.g., enforces borrowing in subtraction, guarantees distribution opportunities in algebra). Templates and parameterization yield a combinatorial explosion—exceeding one million distinct (problem, malrule, trace) triplets—enabling high-volume, supervised evaluation of both correct and malrule-consistent reasoning. All instances automatically yield paired step-traces for fine-grained diagnosis and trace-based prompting.

4. Empirical Observations and Performance Analysis

A suite of nine LLMs (4B–120B parameters) were assessed via MalruleLib benchmarks under multiple experimental regimes. Key results are summarized below.

Condition	MRA (%)	Δ vs. CRA
Correct Reasoning Accuracy	65.7	baseline
Same-template, answer-only	56.1	–9.7
Cross-template, answer-only	40.5	–25.3
Same-template, with-steps	64.6	–1.1
Cross-template, with-steps	46.5	–19.2
Forward MRA (FMRA)	32.3	–33.5

Cross-template degradation: On average, MRA drops 15.6 points (56.1→40.5) when moving from same-template to cross-template prediction.
Step-trace benefit: Supplying malrule-consistent reasoning traces improves cross-template MRA by approximately 6 points (40.5→46.5), with gains varying by model (3–15 points).
Domain variance: MRA achieves maximal accuracy in Functions (≈82%) and minimal in Coordinate Geometry (≈29%), reflecting a >50 point spread and indicating domain-dependent diagnostic reliability. Systems should therefore calibrate their feedback mechanisms by problem domain.

5. Algorithms, Evaluation Protocols, and Sampling Procedures

Canonical pseudocode is provided to ensure reproducibility and clarity in both trace generation and metric computation.

Dual-path trace generation:

for each malrule m in 𝓜:
  for each template t in 𝓣ₘ:
    for param_set in sample_parameter_assignments(t):
      i = t.instantiate(param_set)
      (S_c, a_c) = correct_algorithm_m(i)
      (S_m, a_m) = malrule_algorithm_m(i)
      store_instance(i, S_c, a_c, S_m, a_m)

Computing MRA:

correct = 0
for (i_s, i_t) in P:
  # answer-only: prompt = [describe i_s, a_m(i_s), describe i_t]
  # with-steps: prompt = [i_s, a_m(i_s), S_m(i_s), i_t]
  a_hat = model.generate_answer(prompt)
  if normalize(a_hat) == normalize(a_m(i_t)):
    correct += 1
MRA = correct / |P|

Cross-template pair sampling:

P = ∅
for each malrule m:
  let templates = 𝓣ₘ
  if |templates| ≥ 2:
    for k in 1..100:
      t1, t2 = random.sample(templates, 2)
      i_s = sample_instance(t1)
      i_t = sample_instance(t2)
      P.add((i_s, i_t))

This controlled experimental pipeline ensures fair, replicable, and context-sensitive evaluation of reasoning under both correct and misprocedural student models.

6. Significance, Limitations, and Broader Connections

Malrule Reasoning Accuracy (MRA) provides a diagnostic “Turing Test” for educational modeling: the challenge is not merely to solve mathematical problems, but to infer the hidden, systematic procedure behind student mistakes and extrapolate errorful behavior in novel contexts. Empirical findings show substantial gaps—up to 25 percentage points on cross-template variants—between direct problem-solving and error-pattern modeling. The provision of intermediate student reasoning traces mitigates these drops, indicating the critical value of stepwise supervision for effective misconception modeling.

MalruleLib’s infrastructure—combining executable malrules, extensive template parameterization, and paired reasoning traces—enables scalable analysis of both standard and misconception-driven AI reasoning. The diversity and scale of templates, together with automated dual-path supervision, underpin robust measurement and reveal domain-dependent weaknesses crucial for tutoring system deployment.

A plausible implication is that the integration of trace-level supervision and template variety should become a standard in assessing educational models’ diagnostic capabilities. The observed template and domain sensitivities suggest future work may benefit from the formulation of adaptive feedback systems and domain-specific model calibration.

7. Future Directions and Research Implications

Potential advances include expanding malrule coverage, refining template parameterization for finer contextual variation, and integrating causal granularity inspired by multi-aspect reasoning evaluation frameworks (Do et al., 23 Oct 2025). Extension of MRA to cross-domain, multi-step reasoning outside mathematics—using parallel criteria of relevance, coherence, and stepwise causal evaluation—may yield further diagnostic metrics for AI systems in broader educational and cognitive applications.

In sum, MRA operationalizes large-scale, context-sensitive diagnosis of student misconceptions, differentiating simple correctness from the challenging task of error-pattern inference and extrapolation. The MalruleLib framework establishes both the formal groundwork and empirical evidence base for rigorous educational AI evaluation.

Markdown Report Issue Upgrade to Chat

References (2)

MalruleLib: Large-Scale Executable Misconception Reasoning with Step Traces for Modeling Student Thinking in Mathematics (2026)

What Defines Good Reasoning in LLMs? Dissecting Reasoning Steps with Multi-Aspect Evaluation (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Malrule Reasoning Accuracy (MRA).