AITutor-EvalKit: AI Tutor Evaluation Toolkit

Updated 10 December 2025

AITutor-EvalKit is a comprehensive toolkit that defines AI tutor evaluation through quantitative metrics and qualitative analysis, including conditional distractor generation and TEACH-AI alignment.
The framework employs a two-phase, Turing-like evaluation method using advanced statistical tests and benchmarks (e.g., action accuracy, hint quality) to compare AI tutoring performance with expert standards.
Its modular design integrates data collection, model interfacing, expert annotation, and real-time visualization, ensuring scalable, ethical, and continuous improvement in AI tutoring evaluations.

AITutor-EvalKit is a comprehensive, modular software toolkit for evaluating the pedagogical capabilities and effectiveness of AI-driven tutoring systems. Its core mandate is to enable rigorous, multi-dimensional assessment of AI tutors against benchmarks paralleling expert human instruction and pedagogical quality. The framework encompasses statistical, behavioral, and workflow-oriented methodologies, supporting both quantitative and qualitative validation at scale. It is underpinned by protocols such as conditional distractor generation, in situ benchmarking using real intelligent tutoring systems (ITSs), ensemble model evaluation, and the ten-component TEACH-AI framework for sociotechnical and educational robustness (Sonkar et al., 21 Feb 2025, Weitekamp et al., 2 May 2025, Hikal et al., 24 May 2025, Naeem et al., 3 Dec 2025, Ding et al., 28 Nov 2025).

1. Protocols for Evaluating Student Modeling and Reasoning

AITutor-EvalKit operationalizes a two-phase Turing-like evaluation paradigm designed to probe whether a candidate AI tutor robustly models individual student cognition. In Phase 1, students generate open-ended responses to a bank of questions, surfacing authentic misconceptions otherwise elided by multiple-choice formats. Incorrect responses are logged as tuples (student, question, answer) reflecting each student’s unique error vector.

Phase 2 then rigorously tests the tutor’s ability to predict the same student’s next likely mistake. For every Phase 1 error, both the AI and human experts, conditioned on the specific error, generate distractors for a new, related question. This step uses the model’s question-generation and distractor functions:

$q' = f_{\text{LLM}}(q, A_1)$ (question)
$A' = g_{\text{AI}}(q, A_1, q')$ (AI distractor)
$A'' = g_{\text{H}}(q, A_1, q')$ (human distractor) Empirical student selection rates for these distractors are compared using paired z-tests and McNemar’s test, controlling for random guessing ( $p_\text{random} = 1/4$ ) and within-student correlations. Sample-size planning leverages Theorem 3 (Misconception Concentration) and Theorem 4 (Sample Complexity):

$N_\text{students} \geq \frac{1}{p_\text{min}}\log\left(\frac{k}{\delta}\right),\quad Q_\text{questions} \geq \frac{k}{t}\ln\left(\frac{k}{\delta}\right)$

Where $k$ is the number of target misconceptions and $\delta$ is the desired error probability (Sonkar et al., 21 Feb 2025).

2. Metrics and Benchmarks for AI Tutor Behaviors

The toolkit incorporates standardized metrics adapted from ITS literature and shared tasks:

Next-step action accuracy: $Acc_\text{step} = \frac{\#\,\text{correct next-steps}}{\#\,\text{total states}}$
Incorrect-action detection precision/recall
Hint Quality Score: normalized overlap between AI and gold-standard hints
Feedback Accuracy: correspondence to expert/ITS labels
Learning-curve alignment: root-mean-square deviation between AI and human error trajectories
Macro-F1 on pedagogical dimensions $\left(\text{Mistake Identification, Mistake Location, Guidance, Actionability}\right)$ : strict (3-way) and lenient (binary) averaged over classes
Area Under the Learning Curve (AULC) for simulated learner performance (Weitekamp et al., 2 May 2025, Hikal et al., 24 May 2025).

Empirical findings indicate current LLM-based tutors attain 52–70% next-step action accuracy and exhibit limited capacity for precise error identification (~50% recall). By contrast, simulated learners demonstrate human-like error curves via in-context learning, though true skill acquisition remains unverified (Weitekamp et al., 2 May 2025).

3. Software Architecture and Implementation

AITutor-EvalKit consists of five primary modules:

Data Collection Service: UI for administering both open-ended and conditional questions, secure database (student/response IDs)
Model Interface Layer: wrappers for querying LLMs (GPT-5, Llama, Prometheus2) to generate follow-up questions/distractors
Expert Annotation Pipeline: interfaces for teacher-crafted distractors and auditing AI outputs
Statistical Analysis Engine: implements sample-size planning, confidence intervals, effect-size estimation, hypothesis testing
Visualization/Reporting Dashboard: real-time progress tracking, interactive plots (error bars, spider charts), exportable reports (Sonkar et al., 21 Feb 2025, Naeem et al., 3 Dec 2025)

The core evaluation loop is presented in pseudocode blocks for both student conditioning and statistical analysis. Modules are containerized (Docker) and support cloud deployment, pilot study dry-run mode, and continuous configuration for new curricula/domains.

4. Multi-Dimensional Pedagogical Assessment

Building on the BEA 2025 Shared Task and the SMR taxonomy, AITutor-EvalKit operationalizes four pedagogical axes with ternary labels ( ${Yes, To some extent, No}$ ):

Mistake Identification
Mistake Location
Providing Guidance
Actionability

Evaluators (automated LLM models, human teachers, ensemble judges) rate tutor responses to student dialogue episodes. Automated scoring achieves up to 0.72 accuracy and 0.60 macro-F1, surpassing several human-aligned LLM judges (e.g., Prometheus2 at 0.47/0.41, GPT-5 at 0.66/0.57) especially in the Mistake Identification and Actionability tracks (Naeem et al., 3 Dec 2025). Disagreement-aware ensemble inference improves coverage of minority labels and corrects majority-class bias by aligning output distributions to dev splits (Hikal et al., 24 May 2025). Summary tables report component-wise metrics (strict macro-F1, accuracy, per-dimension ranking).

Qualitative dashboards enable inspection of tutor ratings, tutor comparisons, spider/bar charts, and exportation of aggregate annotation data for further model tuning or human-in-the-loop retraining.

5. TEACH-AI Alignment: Holistic, Sociotechnical Evaluation

AITutor-EvalKit integrates the TEACH-AI framework—a ten-component rubric encompassing both technical and sociotechnical factors:

Explainability: $ExplainabilityScore = \frac{w_c C + w_t T + w_f F}{w_c + w_t + w_f}$ ; user/trace/fidelity ratings
Helpfulness: $TaskSuccessRate = \frac{\#\,\text{correct}}{\#\,\text{attempted}},\; HintRelevance$
Adaptivity: $AdaptationRate = \frac{\#\,\text{adaptive}}{\#\,\text{total}},\; PersonalizationAccuracy$
Consistency: $OutputStability = \frac{\#\,\text{stable}}{k}$
Learning Exploration: CSI questionnaire, $ReflectionPromptsRate$
System Usability: SUS score, $TaskCompletionTime$ , $ErrorRecoveryRate$
Responsibility & Ethics: $FairnessGap$
Accessibility: WCAG compliance level
Workflow Coordination: $WorkflowCoherenceScore$ , $TaskDecompositionRate$
Refinement: $ErrorDetectionRate$ , $TimeToRefine$ , $TraceLogCompleteness$

For each dimension, detailed rubrics are provided (e.g., thresholds, application methodology), supporting stakeholder-aligned audit procedures for compliance, accountability, and inclusivity (Ding et al., 28 Nov 2025).

6. Data Annotation, Visualization, and Continuous Improvement

AITutor-EvalKit embeds annotation workflows for collecting ratings of tutor helpfulness, comparative performance, and free-form feedback. These annotations can be exported for model retraining and refinement, closing the loop for ongoing educational impact. Visualization tools provide spider/radar plots, dataset summaries, and inter-judge comparisons for evaluating agreement and surfacing inconsistencies between automated and human judges.

Periodic re-estimation of population misconception distributions, continuous adaptation of model prompts and evaluation templates, and inclusion of new pedagogical axes (via the modular instruction pipeline) position the kit for long-term extension across domains, grade levels, languages, and instructional settings (Naeem et al., 3 Dec 2025, Hikal et al., 24 May 2025).

7. Deployment, Ethical Safeguards, and Limitations

The toolkit is delivered under an open MIT license (source: github.com/kaushal0494/AITutor-EvalKit). Installation is managed via pip for backend modules and npm for the frontend, with plug-and-play support for importing new datasets or replacing model backends.

Ethical protocols include informed consent, anonymized data storage, FERPA/GDPR compliance, equity audits on model fairness, privacy checklists, and accessibility reviews per WCAG standards. Current limitations include domain and grade specificity (middle-school mathematics; English language; mistake remediation scenarios), single-turn context, and existing taxonomies. Future expansions aim to address broader subjects, dialogue turns, multilingual contexts, and integration with classroom workflows (Ding et al., 28 Nov 2025, Naeem et al., 3 Dec 2025).

Through its layered statistical design, modular APIs, multi-axis pedagogical evaluation, and robust ethical safeguards, AITutor-EvalKit constitutes a comprehensive, reproducible framework for certifying the educational validity and equity of AI tutoring systems, enabling systematic benchmarking, stakeholder engagement, and iterative model improvement.