Self-Explanations in AI & Education
- Self-explanations are on-the-fly, instance-level justifications generated during decision-making processes in models and human learning, enhancing clarity and accountability.
- They are integrated within model architectures like GNNs and LLMs to deliver faithful, causal explanations that improve predictions, generalization, and interpretability.
- Applications span enhancing user trust and AI robustness in machine learning to boosting retention and transfer in educational contexts through structured self-directed explanations.
Self-explanations (SEs) are model- or agent-generated explanations of individual decisions, predictions, or reasoning steps produced contemporaneously with inference or training. SEs arise across machine learning, LLMs, graph neural networks, and education, operating at both algorithmic and cognitive levels. Unlike post hoc explanations, SEs are inherently intertwined with model computation and can be directly harnessed to augment learning, generalization, interpretability, and user trust.
1. Formal Definitions, Scope, and Historical Context
Self-explanations (SEs) are instance-level explanations generated by a model (or human learner) to justify, articulate, or clarify its output, often on a per-decision basis. The defining property of SEs is their “on-the-fly” construction by the decision-maker itself, as opposed to post hoc, model-agnostic surrogate explanations. In deep learning, SEs are now pervasive in natural language (chain-of-thought rationales, token rationales, counterfactual edits), computer vision (masking, heatmaps), and relational domains (node/edge masks in GNNs) (Huang et al., 2024, Huang et al., 2023, Hosseini et al., 2020, Fragkathoulas et al., 2024, Bassan et al., 5 Feb 2025).
Representative Forms
- Feature-level rationales: Binary or graded selections of input tokens, pixels, or edges as justification for predictions (Huang et al., 2023, Brandl et al., 2024).
- Free-text/extractive rationales: Human-readable natural language chains of reasoning or extracted salient input spans (Dehghanighobadi et al., 25 Feb 2025, Randl et al., 2024, Brandl et al., 2024).
- Counterfactual explanations: Minimal input modifications designed so the model’s prediction changes to a target class (Randl et al., 2024, Dehghanighobadi et al., 25 Feb 2025).
- Minimal sufficient reasons: The smallest set of feature values that guarantee preserved model output under any permutation of the remaining inputs (Bassan et al., 5 Feb 2025).
In education, SEs refer to learners’ self-directed generation of explanations integrating new content with prior schema, widely studied in mathematics and statistics learning for their effect on retention and transfer (Gao et al., 25 Mar 2025, Gao et al., 20 Aug 2025).
The paradigm of SEs emerged from both cognitive science (self-explanation effect—Chi et al., 1989) and explainable AI, evolving from the study of introspective reasoning in humans to algorithmic mechanisms in contemporary models.
2. Self-Explanations in Machine Learning Architectures
2.1 Neural Networks with Built-in SE Mechanisms
Modern “self-explaining” architectures embed explanation mechanisms in the network itself, allowing the model to output both predictions and associated mask or rationale vectors. Notable frameworks include:
- SES (Self-Explained and Self-Supervised GNNs): Combines a backbone GNN encoder with a global mask generator to produce node-wise feature and edge masks during training, ensuring explanations are causally aligned with the model’s own message passing (Huang et al., 2024).
- Sufficient Subset Training (SST): Augments classifiers with an additional explanation head, producing continuous mask vectors at each inference. Training enforces that the masked input induced by the explanation preserves the prediction with minimal information, optimizing for faithfulness and conciseness (Bassan et al., 5 Feb 2025).
- Learning by Self-Explanation (LeaSE, LSX): Employs two-network loops (explainer and critic/audience). The explainer creates explanations that are then used to teach an auxiliary model (audience or critic), and feedback from the critic iteratively refines the explainer, yielding improved generalization and interpretability (Hosseini et al., 2020, Stammer et al., 2023).
2.2 LLMs and Prompt-Based SEs
Instruction-tuned LLMs can be prompted to generate verbalized SEs in the form of extractive rationales, chain-of-thought (CoT) demonstrations, or counterfactual examples. These are typically elicited post-prediction but can impact user comprehension and model trust (Randl et al., 2024, Brandl et al., 2024, Dehghanighobadi et al., 25 Feb 2025). SEs in LLMs require careful prompt engineering, as the format, specificity, and evaluation protocols materially affect their alignment with model internals and human understanding.
3. Faithfulness, Plausibility, and Evaluation Metrics
A central concern is the faithfulness of SEs: whether an explanation accurately reflects the model’s true computational rationale rather than plausible but post hoc rationalization. Faithfulness in SEs is distinct from plausibility (coherence to a human judge), and these often diverge (Agarwal et al., 2024, Madsen et al., 2024).
3.1 Quantitative Proxies for SE Faithfulness
- Self-consistency checks: Modify or mask components claimed as important by the SE and measure the impact on the prediction (Madsen et al., 2024, Brandl et al., 2024, Huang et al., 2023).
- Counterfactual validity: For self-generated counterfactuals, check whether the model’s output on the revised input matches the claimed label (Randl et al., 2024, Dehghanighobadi et al., 25 Feb 2025, Wang et al., 1 Jan 2026).
- Comprehensiveness/sufficiency: Measure confidence change or accuracy drop when SE-identified rationale components are removed or retained (Huang et al., 2023, Brandl et al., 2024).
- Similarity to human rationales: Pairwise agreement (Cohen’s κ, F₁) between model-generated and human-annotated spans (Brandl et al., 2024).
- Automated faithfulness metrics: Specialized to architecture—e.g., mask application in GNNs (Huang et al., 2024), l₁ penalty in SST (Bassan et al., 5 Feb 2025).
3.2 Core Findings
- SEs are often highly plausible but can be unfaithful, especially in LLMs where explanations may follow learned patterns rather than actual computation traces (Agarwal et al., 2024, Randl et al., 2024, Madsen et al., 2024).
- Extractive and counterfactual SEs correlate moderately with human rationales on some classification tasks but do not always align with other model introspection signals (e.g., gradients, attention) (Randl et al., 2024, Brandl et al., 2024).
- Prompting LLMs for counterfactual explanations, with well-designed and validated prompts, yields more faithful and easily-verifiable rationales, often rivaling local surrogate explainers such as SHAP and LIME in local faithfulness (Randl et al., 2024, Dehghanighobadi et al., 25 Feb 2025).
4. SEs in Learning, Generalization, and Representation Refinement
Self-explanation strategies provide not only post hoc interpretability but can regularize and improve learning. Applying SEs inside the training loop enables models to calibrate their internal representations, increase robustness, and sometimes outperform standard regularization and even knowledge distillation baselines (Hosseini et al., 2020, Gu et al., 2020, Stammer et al., 2023, Huang et al., 2024).
- Self-distillation via SEs: Models can distill “dark knowledge” from their own explanations, constructing soft targets that encode both incorrect-class responses and inter-class similarity, paralleling the benefits of teacher-student distillation but without the need for an external teacher (Gu et al., 2020).
- Refinement by Critique (LSX/LeaSE): A learner model iteratively receives feedback from a critic trained to solve the task given only the explanation, enforcing alignment between task performance and explanation usefulness. Gains include improved generalization, reduced shortcut reliance, increased explanation class-separability, and higher causal faithfulness (Stammer et al., 2023, Hosseini et al., 2020).
- Graph Neural Networks: Embedding mask-based SEs in GNN training, as in SES, allows explanations to be directly leveraged for contrastive and supervised objectives, closing the gap between interpretability and predictive performance (Huang et al., 2024).
5. Applications in Education: Cognitive Impacts and Instructional Theory
In human learning, SEs are a well-established generative strategy with substantial empirical support in mathematics and statistics education (Gao et al., 25 Mar 2025, Gao et al., 20 Aug 2025). Key theoretical foundations include:
- Retrieval Practice Hypothesis: SEs demand recall and articulation, enhancing memory consolidation and long-term retention.
- Generative Learning Hypothesis: By composing and articulating explanations, learners integrate new content with prior schema, deepening understanding and facilitating transfer to novel problems.
- Best Practices: Scaffolded prompts, exemplar explanations, and immediate feedback are necessary to harness SEs’ benefits. Optimizing SE tasks requires balancing cognitive load and supporting learners with limited prior knowledge.
Quantitative outcomes in mathematics education show medium-to-large effect sizes for SE interventions on immediate learning gains (d ≈ 0.5–0.8), especially when SEs are paired with worked examples. However, sustaining these benefits over time and transferring them to new contexts requires targeted instructional design (Gao et al., 25 Mar 2025, Gao et al., 20 Aug 2025).
6. Limitations, Open Challenges, and Future Directions
6.1 Architectural and Methodological Constraints
- Faithfulness remains variable: SEs’ faithfulness varies by model size, architecture, prompt structure, and task. In LLMs, explanation format (counterfactual/edit vs. extractive vs. attribution) interacts nontrivially with both performance and truthfulness (Madsen et al., 2024, Agarwal et al., 2024, Dehghanighobadi et al., 25 Feb 2025).
- Computational overhead and design trade-offs: Self-explaining models may require additional parameters (e.g., extra heads), increased memory for mask storage (e.g., SES in dense graphs), and fine-tuning of explanation-related loss weights and thresholds (Huang et al., 2024, Bassan et al., 5 Feb 2025).
- Quantization sensitivity: Compression techniques such as post-training quantization introduce moderate (~4% quality, ~2% faithfulness) declines in SE performance, impacting both user trust and explanation coherence, with greater degradation for smaller models and free-text rationales (Wang et al., 1 Jan 2026).
6.2 Evaluation and Theoretical Gaps
- Plausibility–faithfulness gap: High human plausibility does not ensure high faithfulness. SEs that are articulate and plausible might not reflect genuine model decision logic, particularly in RLHF-tuned LLMs (Agarwal et al., 2024, Randl et al., 2024).
- Task and model specificity: No single SE format dominates across all model-task pairs. For Llama2, counterfactuals are most faithful on sentiment, while in Mistral, attribution is superior; redaction explanations benefit Falcon 40B (Madsen et al., 2024).
- Metrics for open-ended tasks: Evaluation frameworks for generative, multi-modal, or open-ended settings are underdeveloped relative to classification domains.
6.3 Research and Practice Implications
- Systematic integration of faithfulness audits into deployment pipelines, especially for high-stakes applications in law, medicine, and safety-critical domains (Agarwal et al., 2024, Wang et al., 1 Jan 2026).
- Development of architectures and training strategies explicitly targeting faithful SEs—potential approaches include domain-specific fine-tuning with gold explanations, in-context learning with true reasoning chains, and mechanistic interpretability for alignment with network computations (Agarwal et al., 2024).
- In education and model training, refinement of SE tasks to maximize learning gains, mitigate cognitive overload, and ensure explanation quality.
7. Representative Empirical Results Across Domains
| Domain/Task | Method/Model | Faithfulness/Plausibility Score (as reported) | Remarks |
|---|---|---|---|
| Sentiment/IMDB LLM | Llama 2-70B (counterfactual) | ≈ 50% (counterfactual faithfulness) | Varies strongly with model family |
| Text classification | Llama3.1-8B (SE vs human) | κ(H,SE): 0.60 (SST, English) | Far above post hoc LRP baseline |
| GNN Node Class. | SES(GCN/GAT) | up to +2.6 pts accuracy; ≈4× higher fidelity | Outperforms GNNExplainer at 4 s vs 10 min |
| Vision (MNIST) | SST (robust) | 99.3% suff. faithfulness | Explanations: 1.42% of features |
| Education/Math | Multiple interventions | d ≈ 0.5–0.8 (immediate posttest, SE vs control) | Sustained gains with quality SE, scaffolding |
These results underscore that SEs, when systematically integrated, improve both model performance and human interpretability, but require careful design, evaluation, and task alignment to realize their full potential.
References
- (Huang et al., 2024) SES: Bridging the Gap Between Explainability and Prediction of Graph Neural Networks
- (Randl et al., 2024) Evaluating the Reliability of Self-Explanations in LLMs
- (Wang et al., 1 Jan 2026) Can LLMs Still Explain Themselves? Investigating the Impact of Quantization on Self-Explanations
- (Hosseini et al., 2020) Learning by Self-Explanation, with Application to Neural Architecture Search
- (Hong et al., 7 Jan 2026) Do LLM Self-Explanations Help Users Predict Model Behavior? Evaluating Counterfactual Simulatability with Pragmatic Perturbations
- (Fragkathoulas et al., 2024) Local Explanations and Self-Explanations for Assessing Faithfulness in black-box LLMs
- (Bassan et al., 5 Feb 2025) Explain Yourself, Briefly! Self-Explaining Neural Networks with Concise Sufficient Reasons
- (Madsen et al., 2024) Are self-explanations from LLMs faithful?
- (Brandl et al., 2024) Comparing zero-shot self-explanations with human rationales in text classification
- (Gao et al., 25 Mar 2025) Student Explanation Strategies in Postsecondary Mathematics and Statistics Education: A Scoping Review
- (Gao et al., 20 Aug 2025) Student explanation in middle and secondary mathematics and statistics: A scoping literature review
- (Huang et al., 2023) Can LLMs Explain Themselves? A Study of LLM-Generated Self-Explanations
- (Agarwal et al., 2024) Faithfulness vs. Plausibility: On the (Un)Reliability of Explanations from LLMs
- (Gu et al., 2020) Introspective Learning by Distilling Knowledge from Online Self-explanation
- (Stammer et al., 2023) Learning by Self-Explaining