Non-identifiability of Explanations from Model Behavior in Deep Networks of Image Authenticity Judgments

Published 8 Apr 2026 in cs.CV and cs.LG | (2604.07254v1)

Abstract: Deep neural networks can predict human judgments, but this does not imply that they rely on human-like information or reveal the cues underlying those judgments. Prior work has addressed this issue using attribution heatmaps, but their explanatory value in itself depends on robustness. Here we tested the robustness of such explanations by evaluating whether models that predict human authenticity ratings also produce consistent explanations within and across architectures. We fit lightweight regression heads to multiple frozen pretrained vision models and generated attribution maps using Grad-CAM, LIME, and multiscale pixel masking. Several architectures predicted ratings well, reaching about 80% of the noise ceiling. VGG models achieved this by tracking image quality rather than authenticity-specific variance, limiting the relevance of their attributions. Among the remaining models, attribution maps were generally stable across random seeds within an architecture, especially for EfficientNetB3 and Barlow Twins, and consistency was higher for images judged as more authentic. Crucially, agreement in attribution across architectures was weak even when predictive performance was similar. To address this, we combined models in ensembles, which improved prediction of human authenticity judgments and enabled image-level attribution via pixel masking. We conclude that while deep networks can predict human authenticity judgments well, they do not produce identifiable explanations for those judgments. More broadly, our findings suggest that post hoc explanations from successful models of behavior should be treated as weak evidence for cognitive mechanism.

Abstract PDF Upgrade to Chat

Authors (2)

Summary

The paper demonstrates that high prediction accuracy does not guarantee psychologically valid, identifiable explanations for image authenticity.
It evaluates various attribution methods across different CNN architectures, revealing significant disparities in explanation maps.
The study finds that ensemble techniques improve predictive performance but do not resolve the inherent non-identifiability of model explanations.

Non-Identifiability of Explanations from Model Behavior in Deep Networks of Image Authenticity Judgments

Introduction

The paper addresses a central question in the application of deep neural networks (DNNs) to modeling human judgments: whether high predictive accuracy in predicting human perceptions, specifically image authenticity, translates to mechanistic psychological interpretability via post hoc explanations. In particular, the work systematically evaluates the stability and cross-model agreement of explanation methods (attribution heatmaps) for DNNs tasked with predicting human authenticity ratings of AI-generated images. This issue is critical due to the proliferation of hyper-realistic AI-generated imagery and the ambiguity surrounding the features that drive human perception of authenticity.

Methodological Framework

The authors employ a suite of pre-trained, frozen convolutional vision backbones (Barlow Twins, ResNet152, DenseNet161, EfficientNetB3, VGG16/19). Lightweight regression heads are trained atop these frozen features to predict human authenticity ratings on curated subsets of the AIGCIQA2023 dataset. Attribution-based (Grad-CAM) and perturbation-based (Multi-Scale Pixel Masking, LIME) explainability techniques are deployed to reveal putative decision evidence. The study operationalizes two orthogonal axes for assessing explanations: (1) within-architecture consistency (stability of explanation maps across random initialization/training sets of adapters on the same backbone) and (2) cross-architecture agreement (alignment of explanation maps across model classes with similar predictive efficacy).

Importantly, the paper implements sequential backward selection for channel pruning, focusing explanations on units maximally predictive for authenticity on the held-out test set, and explicitly separates authenticity-related from quality-related variance via partial correlation analysis.

Key Results

Model Performance and Specificity

Several architectures, principally Barlow Twins and DenseNet161, approach 80% of the human inter-rater noise ceiling in predicting mean authenticity ratings (PLCC ~0.62–0.63).
VGG architectures, despite moderate predictive performance, are shown (through partialing out quality) to rely almost exclusively on quality-related image cues, capturing negligible authenticity-specific residual variance.
Residual authenticity-specific predictive signals are modest even in better architectures (partial $r$ for Barlow Twins: 0.24, DenseNet161: 0.16), indicating strong confounding between perceived quality and authenticity in the dataset.

Stability and Agreement of Explanations

Within-architecture consistency: For Barlow Twins, ResNet152, VGG16/19, and to a lesser extent DenseNet161/EfficientNetB3, attribution maps are highly stable across random seeds (mean pixel-level $r > 0.85$ for Grad-CAM and LIME, IoU overlap at top saliency thresholds is also high).
Correlation with image authenticity: For Barlow Twins and EfficientNetB3, images rated as more authentic by humans exhibit higher explanation consistency, suggesting certain image features driving both human and model judgments.
Cross-architecture agreement: Attribution maps across model types show weak to negligible correlation, with the only consistent pattern of similarity between the closely related VGG variants. Even among models with similar predictive performance, attributions diverge substantially. This holds for both Grad-CAM and perturbation methods (MPM, LIME).

Ensemble Methods

Ensembles (bagging/stacking across all model variants and architectures) outperform all single models in predictive accuracy (RMSE=6.04, PLCC=0.73).
Ensemble-level attribution maps obtained by MPM can spatially segregate evidence for/against authenticity, though their interpretability remains circumscribed by the above-noted lack of explanation consensus.
LIME surrogate models can moderately approximate ensemble predictions (fit $R^2\approx0.6$ –$0.7$), but the resulting attribution maps only partly align with direct MPM-based ensemble attributions.

Implications for Psychologically-Informed Explainability

The findings decisively challenge the automatic conflation of high behavioral prediction with cognitive plausibility in DNN models of subjective judgments:

Non-identifiability of explanations: Even when a DNN provides a stable, high-fidelity prediction of human judgments and consistently highlights image regions within an architecture, these explanations are not identifiable as psychologically meaningful unless they also generalize across architectures. The lack of cross-architecture explanation consensus, as observed here, means any single model’s explanations are weak—and in practice, unreliable—evidence for underlying cues actually used by human observers.
Evidence against strong cognitive claims: The results generalize the Rashomon effect to the domain of human-like prediction tasks: multiple models with similar performance can exploit distinct and possibly non-overlapping feature sets.
Role of ensembles and future directions: Aggregating over diverse architectures via ensembles enhances predictive alignment, but the non-trivial challenge remains in interpreting ensemble attributions and relating them to human cognition. Establishing psychological validity will require triangulation with behavioral or neurophysiological benchmarks that can localize the critical features in question.
Separation of evaluation metrics: The paper proposes that evaluation of DNNs as models of human perception should treat predictive accuracy and explanation consistency as separable—not automatically linked—dimensions.

Limitations and Prospects

The study’s reliance on frozen backbones (due to dataset size constraints) means results are conditional on the class of features learned via large-scale supervised/self-supervised pretraining.
The partial correlation analysis reveals that much authenticity-related variance overlaps with quality; datasets with greater dissociation of these constructs are needed for more definitive conclusions about authenticity-specific feature usage.
Computational cost currently precludes full scalability of perturbation-based explainability at the ensemble level.

Future research directions include jointly modeling perceptual cue salience across humans and DNNs, new benchmarks for explanation validity, and development of integrated attribution consensus methods.

Conclusion

This study rigorously demonstrates that, in the context of image authenticity judgments, accurate prediction of aggregate human ratings by DNNs does not entail mechanistic interpretability regarding the critical perceptual evidence underlying those judgments. Explanation maps, even when internally consistent, display substantial variance across models, undermining their value as evidence for human-like cognitive mechanisms. The work sets a necessary standard for the psychological interpretation of post hoc explainability in neurocognitive modeling and highlights the imperative to dissociate prediction and explanation in evaluating AI systems as candidate models of human cognition.

Reference:

Non-identifiability of Explanations from Model Behavior in Deep Networks of Image Authenticity Judgments (2604.07254)

Markdown Report Issue