Papers
Topics
Authors
Recent
2000 character limit reached

Self-Explanations in AI & Education

Updated 8 January 2026
  • Self-explanations are on-the-fly, instance-level justifications generated during decision-making processes in models and human learning, enhancing clarity and accountability.
  • They are integrated within model architectures like GNNs and LLMs to deliver faithful, causal explanations that improve predictions, generalization, and interpretability.
  • Applications span enhancing user trust and AI robustness in machine learning to boosting retention and transfer in educational contexts through structured self-directed explanations.

Self-explanations (SEs) are model- or agent-generated explanations of individual decisions, predictions, or reasoning steps produced contemporaneously with inference or training. SEs arise across machine learning, LLMs, graph neural networks, and education, operating at both algorithmic and cognitive levels. Unlike post hoc explanations, SEs are inherently intertwined with model computation and can be directly harnessed to augment learning, generalization, interpretability, and user trust.

1. Formal Definitions, Scope, and Historical Context

Self-explanations (SEs) are instance-level explanations generated by a model (or human learner) to justify, articulate, or clarify its output, often on a per-decision basis. The defining property of SEs is their “on-the-fly” construction by the decision-maker itself, as opposed to post hoc, model-agnostic surrogate explanations. In deep learning, SEs are now pervasive in natural language (chain-of-thought rationales, token rationales, counterfactual edits), computer vision (masking, heatmaps), and relational domains (node/edge masks in GNNs) (Huang et al., 2024, Huang et al., 2023, Hosseini et al., 2020, Fragkathoulas et al., 2024, Bassan et al., 5 Feb 2025).

Representative Forms

In education, SEs refer to learners’ self-directed generation of explanations integrating new content with prior schema, widely studied in mathematics and statistics learning for their effect on retention and transfer (Gao et al., 25 Mar 2025, Gao et al., 20 Aug 2025).

The paradigm of SEs emerged from both cognitive science (self-explanation effect—Chi et al., 1989) and explainable AI, evolving from the study of introspective reasoning in humans to algorithmic mechanisms in contemporary models.

2. Self-Explanations in Machine Learning Architectures

2.1 Neural Networks with Built-in SE Mechanisms

Modern “self-explaining” architectures embed explanation mechanisms in the network itself, allowing the model to output both predictions and associated mask or rationale vectors. Notable frameworks include:

  • SES (Self-Explained and Self-Supervised GNNs): Combines a backbone GNN encoder with a global mask generator to produce node-wise feature and edge masks during training, ensuring explanations are causally aligned with the model’s own message passing (Huang et al., 2024).
  • Sufficient Subset Training (SST): Augments classifiers with an additional explanation head, producing continuous mask vectors at each inference. Training enforces that the masked input induced by the explanation preserves the prediction with minimal information, optimizing for faithfulness and conciseness (Bassan et al., 5 Feb 2025).
  • Learning by Self-Explanation (LeaSE, LSX): Employs two-network loops (explainer and critic/audience). The explainer creates explanations that are then used to teach an auxiliary model (audience or critic), and feedback from the critic iteratively refines the explainer, yielding improved generalization and interpretability (Hosseini et al., 2020, Stammer et al., 2023).

2.2 LLMs and Prompt-Based SEs

Instruction-tuned LLMs can be prompted to generate verbalized SEs in the form of extractive rationales, chain-of-thought (CoT) demonstrations, or counterfactual examples. These are typically elicited post-prediction but can impact user comprehension and model trust (Randl et al., 2024, Brandl et al., 2024, Dehghanighobadi et al., 25 Feb 2025). SEs in LLMs require careful prompt engineering, as the format, specificity, and evaluation protocols materially affect their alignment with model internals and human understanding.

3. Faithfulness, Plausibility, and Evaluation Metrics

A central concern is the faithfulness of SEs: whether an explanation accurately reflects the model’s true computational rationale rather than plausible but post hoc rationalization. Faithfulness in SEs is distinct from plausibility (coherence to a human judge), and these often diverge (Agarwal et al., 2024, Madsen et al., 2024).

3.1 Quantitative Proxies for SE Faithfulness

3.2 Core Findings

4. SEs in Learning, Generalization, and Representation Refinement

Self-explanation strategies provide not only post hoc interpretability but can regularize and improve learning. Applying SEs inside the training loop enables models to calibrate their internal representations, increase robustness, and sometimes outperform standard regularization and even knowledge distillation baselines (Hosseini et al., 2020, Gu et al., 2020, Stammer et al., 2023, Huang et al., 2024).

  • Self-distillation via SEs: Models can distill “dark knowledge” from their own explanations, constructing soft targets that encode both incorrect-class responses and inter-class similarity, paralleling the benefits of teacher-student distillation but without the need for an external teacher (Gu et al., 2020).
  • Refinement by Critique (LSX/LeaSE): A learner model iteratively receives feedback from a critic trained to solve the task given only the explanation, enforcing alignment between task performance and explanation usefulness. Gains include improved generalization, reduced shortcut reliance, increased explanation class-separability, and higher causal faithfulness (Stammer et al., 2023, Hosseini et al., 2020).
  • Graph Neural Networks: Embedding mask-based SEs in GNN training, as in SES, allows explanations to be directly leveraged for contrastive and supervised objectives, closing the gap between interpretability and predictive performance (Huang et al., 2024).

5. Applications in Education: Cognitive Impacts and Instructional Theory

In human learning, SEs are a well-established generative strategy with substantial empirical support in mathematics and statistics education (Gao et al., 25 Mar 2025, Gao et al., 20 Aug 2025). Key theoretical foundations include:

  • Retrieval Practice Hypothesis: SEs demand recall and articulation, enhancing memory consolidation and long-term retention.
  • Generative Learning Hypothesis: By composing and articulating explanations, learners integrate new content with prior schema, deepening understanding and facilitating transfer to novel problems.
  • Best Practices: Scaffolded prompts, exemplar explanations, and immediate feedback are necessary to harness SEs’ benefits. Optimizing SE tasks requires balancing cognitive load and supporting learners with limited prior knowledge.

Quantitative outcomes in mathematics education show medium-to-large effect sizes for SE interventions on immediate learning gains (d ≈ 0.5–0.8), especially when SEs are paired with worked examples. However, sustaining these benefits over time and transferring them to new contexts requires targeted instructional design (Gao et al., 25 Mar 2025, Gao et al., 20 Aug 2025).

6. Limitations, Open Challenges, and Future Directions

6.1 Architectural and Methodological Constraints

  • Faithfulness remains variable: SEs’ faithfulness varies by model size, architecture, prompt structure, and task. In LLMs, explanation format (counterfactual/edit vs. extractive vs. attribution) interacts nontrivially with both performance and truthfulness (Madsen et al., 2024, Agarwal et al., 2024, Dehghanighobadi et al., 25 Feb 2025).
  • Computational overhead and design trade-offs: Self-explaining models may require additional parameters (e.g., extra heads), increased memory for mask storage (e.g., SES in dense graphs), and fine-tuning of explanation-related loss weights and thresholds (Huang et al., 2024, Bassan et al., 5 Feb 2025).
  • Quantization sensitivity: Compression techniques such as post-training quantization introduce moderate (~4% quality, ~2% faithfulness) declines in SE performance, impacting both user trust and explanation coherence, with greater degradation for smaller models and free-text rationales (Wang et al., 1 Jan 2026).

6.2 Evaluation and Theoretical Gaps

  • Plausibility–faithfulness gap: High human plausibility does not ensure high faithfulness. SEs that are articulate and plausible might not reflect genuine model decision logic, particularly in RLHF-tuned LLMs (Agarwal et al., 2024, Randl et al., 2024).
  • Task and model specificity: No single SE format dominates across all model-task pairs. For Llama2, counterfactuals are most faithful on sentiment, while in Mistral, attribution is superior; redaction explanations benefit Falcon 40B (Madsen et al., 2024).
  • Metrics for open-ended tasks: Evaluation frameworks for generative, multi-modal, or open-ended settings are underdeveloped relative to classification domains.

6.3 Research and Practice Implications

7. Representative Empirical Results Across Domains

Domain/Task Method/Model Faithfulness/Plausibility Score (as reported) Remarks
Sentiment/IMDB LLM Llama 2-70B (counterfactual) ≈ 50% (counterfactual faithfulness) Varies strongly with model family
Text classification Llama3.1-8B (SE vs human) κ(H,SE): 0.60 (SST, English) Far above post hoc LRP baseline
GNN Node Class. SES(GCN/GAT) up to +2.6 pts accuracy; ≈4× higher fidelity Outperforms GNNExplainer at 4 s vs 10 min
Vision (MNIST) SST (robust) 99.3% suff. faithfulness Explanations: 1.42% of features
Education/Math Multiple interventions d ≈ 0.5–0.8 (immediate posttest, SE vs control) Sustained gains with quality SE, scaffolding

These results underscore that SEs, when systematically integrated, improve both model performance and human interpretability, but require careful design, evaluation, and task alignment to realize their full potential.


References

Definition Search Book Streamline Icon: https://streamlinehq.com
References (16)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Self-Explanations (SEs).