Human Evaluation of Interpretability

Updated 22 April 2026

Human evaluation of interpretability is a multifaceted process that measures how effectively users understand ML model explanations by assessing faithfulness, intelligibility, plausibility, and stability.
Evaluation methodologies such as forward simulation, verification, and counterfactual tasks enable researchers to capture both objective performance and subjective satisfaction across diverse applications.
Quantitative metrics including task accuracy, response time, and user ratings, complemented by rigorous experimental designs, drive best practices for developing trustworthy explanation systems.

Human evaluation of interpretability encompasses empirical methodologies, metrics, and theoretical frameworks for quantifying how, and to what degree, people can understand, predict, and utilize machine learning model explanations. The field covers both functional (task-based) and subjective aspects of interpretability, with applications spanning tabular models, deep neural networks, combinatorial optimization, representation learning, computer vision, and knowledge discovery.

1. Conceptual Foundations and Definitions

Interpretability is defined as the degree to which a human can consistently predict, comprehend, and reason about a model’s behavior or outputs given an explanation artifact. Modern scholarship distinguishes between multiple operative dimensions:

Faithfulness: The extent to which an explanation accurately reflects the model’s true causal responses to input (Pinto et al., 2024).
Intelligibility: The degree to which an explanation enables a user to reliably generate correct predictions, counterfactuals, or rationalizations in domain tasks (Pinto et al., 2024).
Plausibility: Whether an explanation resonates with a user’s background domain knowledge or mental model (Pinto et al., 2024).
Stability: Robustness of the explanation under small input perturbations (Pinto et al., 2024).

These axes are not necessarily aligned: a perfectly faithful but highly complex explanation may not be intelligible, and a plausible explanation may misrepresent the model’s logic. Human evaluation is essential to assess which explanations support end-user verification, trust, and effective oversight (Mohseni et al., 2018, Müller et al., 2023).

2. Human–Centered Evaluation Methodologies

The empirical study of interpretability uses task-driven and subjective protocols:

Forward simulation: Participants apply the explanation to predict the model’s output for observed inputs. Accuracy here directly operationalizes simulatability (Slack et al., 2019).
Verification: Subjects judge the consistency between a model’s decision and the provided explanation, often under time constraints (Narayanan et al., 2018).
Counterfactual simulation: Users predict how small input changes alter model output, probing local explainability (Slack et al., 2019, Pinto et al., 2024).
Distinction and agreement tasks: In vision, tasks such as selecting which of several explanations matches a correct or incorrect output quantify diagnostic value and user trust (Kim et al., 2021).
Subjective ratings: Likert-scale judgments of perceived comprehensibility, satisfaction, or explanatory “goodness” (Mohseni et al., 2018, Müller et al., 2023).
Gaze and reaction-time: Objective measures of attention allocation and cognitive workload during an interpretability task (Pegler et al., 9 Mar 2026, Müller et al., 2023).

Protocols are tailored to model type, explanation form, and intended application (e.g., image classification, combinatorial problem-solving, music theory discovery). They may include application-grounded (in-domain professionals), human-grounded (generic but controlled tasks), or functionally grounded (proxy metrics) methods (Pinto et al., 2024).

3. Quantitative Metrics and Statistical Analysis

Metrics used for human evaluation are constructed to capture performance, processing effort, and subjective response:

Metric	Formalization/Details	Utility
Task accuracy	$\frac{1}{N} \sum_{i=1}^N \mathbb{I}(\text{correct}_i)$	Objective success in simulation, verification, or counterfactual tasks (Lage et al., 2019, Slack et al., 2019)
Response time	Mean or distribution of seconds to task completion	Assesses cognitive efficiency and scale effects (Narayanan et al., 2018, Biessmann et al., 2019)
Satisfaction / Ease-of-use	Post-task Likert ratings, normalized per user	Tracks subjective usability as complexity varies (Narayanan et al., 2018)
Human-attention agreement	1–MAE between model saliency and aggregate human annotation maps	Proxy for semantic congruence in visual/text domains (Mohseni et al., 2018)
Inter-annotator reliability	Pearson/Spearman correlation, Cohen’s $\kappa$	Measures reproducibility and shared interpretation (Paulo et al., 11 Jul 2025)
Psychometric thresholds	Mask fraction at which accuracy crosses criterion (e.g., 75%)	Quantifies perceptual threshold for explanation sufficiency (Biessmann et al., 2019)
Operation count	Total arithmetic/boolean evaluations required by human to simulate model	Complexity proxy for simulatability (Slack et al., 2019)
Distinction/Agreement rate	Fraction of correct identification of outputs or trust in model recommendations	Tests distinguishability and confirmation bias (Kim et al., 2021)

Linear and generalized linear mixed-effects models, repeated-measures ANOVAs, Bayesian multilevel models, and correlation analyses are employed to quantify statistical significance, marginal effects, and user/item variance (Pegler et al., 9 Mar 2026, Dinu et al., 2020). Where applicable, these metrics are compared to non-human-in-the-loop (NHIL) computational proxies to assess alignment (Biessmann et al., 2019).

4. Insights from Domain-Specific Studies

Tabular and Rule-based Models: Number of lines (rules), cognitive chunk introduction, and variable repetitions are rigorously quantified as explanation complexity. Human response time increases most with new cognitive chunks (+6–8 s), especially when defined explicitly rather than used in-line. Task accuracy is robust except under the heaviest complexity (Narayanan et al., 2018, Lage et al., 2019).
Linear and Neural Models: Simulatability and “what if” local explainability are highest for decision trees, moderate for logistic regression, and lowest for neural networks. Human performance drops as the number of operations grows past ~100–150 (Slack et al., 2019).
Vision and Attention Models: Human-centered evaluation frameworks such as HIVE (Kim et al., 2021) and psychophysical approaches (Biessmann et al., 2019) reveal that perceived interpretability may diverge from model fidelity and vary with task, data domain, and explanation visualization. Human attention maps and ground-truth segmentations yield different evaluation standards from standard saliency or heatmap methods (Mohseni et al., 2018, Müller et al., 2023).
Generative Models & Representation Learning: Interactive reconstruction tasks, where users manipulate latent space sliders to match a target instance, robustly differentiate between disentangled and entangled representations, outperforming traditional single-dimension traversals or mutual information-based automated measures (Ross et al., 2021).
Combinatorial Optimization: Human preference for solution explanations is most strongly predicted by ordered visual representation, alignment with greedy heuristics, and compositional simplicity of sub-solutions, as validated in pairwise forced-choice experiments (Pegler et al., 9 Mar 2026).
Sparse Feature Models: Intruder detection, a forced-choice paradigm where users must spot the outlier among grouped activating contexts, is highly correlated with both human and LLM-based judgments, providing a language-model-free metric (Paulo et al., 11 Jul 2025).

5. Challenges, Biases, and Limitations

Multiple studies caution against universal claims of interpretability or relying solely on proxy or automated metrics. Key findings include:

Contextual confounding: Users may ignore or misinterpret explanations in favor of obvious cues (e.g., numeric error rates), and feature-attribution bars can be placebo or even harmful, especially in high-dimensional or unfamiliar domains (Dinu et al., 2020). No effect of top-n features on interpretability was found in paired comparison tasks.
Appearance and semantic bias: Visual style, blockiness, or saliency map smoothness affects subjective ratings and perception of explanation quality, above and beyond objective alignment (Mohseni et al., 2018).
User heterogeneity: Prior knowledge, professional background, and cognitive strategies vary widely; item-level variance often exceeds person-level variance in interpretability studies (Dinu et al., 2020, Kim et al., 2021). Onboarding and detailed tutorials are necessary for consistent evaluation (Alvarez-Melis et al., 2021).
Disentangling explanation vs. evaluation: Methods that conflate explanation generation (e.g., LLM-based sentence summaries) with evaluation obscure the true interpretability of underlying features; direct, forced-choice metrics such as intruder detection help decouple these aspects (Paulo et al., 11 Jul 2025).

6. Best Practices, Design Principles, and Frameworks

Research has converged on several design and evaluation imperatives:

Select task protocols (simulation, verification, counterfactual, distinction) that match the intended practical use and user group (Pinto et al., 2024, Biessmann et al., 2019).
Regularize explanations to minimize rule length, cognitive chunk count, and explicit definition burden, keeping variable repetition secondary (Narayanan et al., 2018, Lage et al., 2019).
For model selection, use runtime operation count as a lower-bound proxy for simulatability; models requiring more than ~200 operations per decision are rarely human-verifiable (Slack et al., 2019).
Visualization techniques and human attention baselines, when anchored in multi-annotator consensus, offer scalable and domain-agnostic reference points for evaluating local explanation methods (Mohseni et al., 2018).
When designing explanation interfaces, incorporate modular, contrastive, and exhaustive presentation following cognitive science principles (e.g., weight-of-evidence, sequential logic) (Alvarez-Melis et al., 2021).
Report both quantitative and qualitative data, including confidence, subjective satisfaction, and error typology, and supply open-source evaluation code and study protocols for reproducibility (Paulo et al., 11 Jul 2025, Dinu et al., 2020).
A unified framework for explanation evaluation should explicitly address stability, faithfulness, plausibility, and intelligibility as orthogonal but co-requisite properties (Pinto et al., 2024).

7. Open Questions and Future Directions

Research continues to seek efficient, reliable human evaluation designs that scale to diverse ML modalities, complex applications, and sophisticated user needs:

Community-wide adoption of standardized interpretability tasks (intruder detection, interactive reconstruction, forward/counterfactual simulation) is needed to support benchmarking and comparative studies (Ross et al., 2021, Paulo et al., 11 Jul 2025).
Extension of human attention and example-based explanation evaluation beyond vision and text into audio, tabular, temporal, and combinatorial domains (Mohseni et al., 2018, Pegler et al., 9 Mar 2026).
Integrating LLM-based evaluations into the human evaluation loop, while maintaining separation of explanation generation and assessment (Paulo et al., 11 Jul 2025).
Operationalizing plausibility alongside intelligibility, especially in domains where semantic coherence may diverge from model logic (Pinto et al., 2024).
Developing methods for quantifying and mitigating confirmation bias, overreliance, and other sociocognitive effects in human–AI collaboration (Kim et al., 2021, Dinu et al., 2020).

The field of human evaluation for interpretability continuously evolves, integrating rigorous experimental design, cognitive science, statistics, and interactive visualization to ensure that explanation systems meet both technical and human-centric criteria for use and trust.