Unlearn to Explain Sonar Framework (UESF)
- The paper demonstrates that integrating targeted contrastive unlearning with an adapted LIME pipeline yields a 35% reduction in seafloor bias while maintaining a 99% accuracy.
- UESF employs a dual-model approach, comparing baseline and unlearned classifiers to generate fine-grained, pixel-level difference maps that indicate effective bias suppression.
- The framework utilizes superpixel decomposition and a modified triplet loss to achieve transparent, interpretable attribution maps that focus on object features rather than background artifacts.
The Unlearn to Explain Sonar Framework (UESF) is a post hoc explainability framework designed to assess, quantify, and visualize the extent to which background (seafloor) bias is removed from sonar image classifiers following a targeted machine unlearning process. UESF sits atop a conventional sonar image classifier and its retrained, “unlearned” counterpart, enabling fine-grained, pixel-level analyses of which environmental features the model no longer relies on after mitigation of seafloor-induced confounding. By coupling targeted contrastive unlearning (TCU) with an adapted Local Interpretable Model-Agnostic Explanations (LIME) pipeline, UESF directly measures and visualizes the reduction in background reliance while promoting focused and interpretable attribution maps, thereby addressing generalization and transparency challenges in sonar image classification (S et al., 1 Dec 2025).
1. Conceptual Formulation and Objectives
UESF is designed to fulfill two key objectives within the bias-unlearning paradigm for sonar object detection:
- Quantification of background-forgetting: UESF computes a per-pixel “explanation difference” by juxtaposing attribution maps from the baseline classifier and the unlearned model, directly measuring the suppression of seafloor cues in the decision process.
- Enhancement of attribution faithfulness and localization: Through strategic adaptation of the LIME framework, UESF localizes saliency onto object regions rather than background artifacts, yielding more semantically meaningful and reliable attributions.
Central to UESF's workflow is its tight integration with the TCU module. The latter re-trains the model backbone using a modification of the triplet loss, explicitly treating seafloor images as negatives. This pushes the object embedding space away from background-induced bias, generating an “unlearned” classifier. UESF then applies matched LIME-based explainers to both the baseline and unlearned models, systematically isolating what has been forgotten.
2. Architectural Pipeline
UESF operates through a defined pipeline that processes each sonar image as follows:
| Step | Operation | Output |
|---|---|---|
| 1 | Input sonar image | Image |
| 2 | Inference via baseline model | , feature maps |
| 3 | Inference via unlearned model | |
| 4 | LIME-based saliency maps extraction | , |
| 5 | Background-feature selection and thresholding | |
| 6 | Difference mask calculation | |
| 7 | Visualization and quantitative reporting | Heatmaps, metrics |
Key architectural modules include:
- Feature Selector (Superpixel Generator): Decomposes into superpixels, serving as atomic units for LIME perturbations.
- Surrogate Explainer (LIME): Fits a sparse linear model for each classifier by generating binary presence indicators over superpixels and regressing model outputs locally.
- Attribution Aggregator: Projects surrogate coefficients back onto the pixel space and thresholds attributions to generate binary background masks.
- Difference Calculator: Computes , highlighting pixels important to the baseline model but not the unlearned one.
3. Methodological Details
3.1 LIME Adaptation for Comparative Attribution
UESF applies LIME to both and for a direct, image-localized explanation of unlearning effects. The adapted process involves:
- Decomposing each input into a shared superpixel map for both models.
- Generating perturbed samples by randomly masking superpixels.
- Weighting perturbed samples using an exponential kernel with set to $0.25$.
- Minimizing LIME’s objective:
where is an penalty.
- Ensuring superpixel correspondences and consistent perturbation across models.
3.2 Targeted Contrastive Unlearning Loss
TCU leverages a triplet loss:
where represent anchor and positive pairs from the same object class and is always a seafloor (background) sample, enforcing a background-geometric separation in the embedding space.
3.3 End-to-End Algorithm
For each input, the process can be summarized as:
- Decompose input into superpixels.
- For : generate perturbed samples, compute predictions, assign locality weights, fit surrogate , recover coefficients.
- Project coefficients onto pixel grid: .
- Threshold to obtain .
- Difference-mask calculation: .
4. Evaluation Metrics and Empirical Validation
- Classification metrics: Accuracy, precision, recall, and F1-score confirm no accuracy loss post-unlearning (both baseline and TCU-unlearned EfficientNet-B0 achieve $0.99$ accuracy).
- t-SNE embeddings: Visual clusters reveal clear disentanglement of seafloor features from object groups after unlearning, indicating successful bias reduction.
- Explanation-difference score: —quantified as the proportion of seafloor-labeled pixels in the difference mask—serves as a metric of bias mitigation: a reduction in background attribution is observed on average.
- Visualization: Side-by-side heatmaps and difference maps (e.g., Figure 1) illustrate the shift from seafloor-focused to object-centric attribution, with forgotten background features (e.g., seabed textures, shadows) highlighted.
A plausible implication is that captures model de-biasing effects with fine granularity, although additional metrics (deletion/insertion curves, sparsity indices) could further validate attribution faithfulness.
5. Qualitative Analyses and Interpretability
Visual comparisons demonstrate that, post-unlearning, LIME heatmaps transition from emphasizing seafloor features (ripples, textures, shadow artifacts) to highlighting object characteristics. The difference map pinpoints superpixels predominantly associated with background bias that have ceased to influence the unlearned model's outputs. This affords direct insight into the granularity of model forgetting and substantiates the claim that unlearning is occurring in intended semantic regions.
6. Advantages, Limitations, and Prospects
Advantages:
- Enables direct, pixel-level interpretability of what has been unlearned, moving beyond black-box re-training assessments.
- Model-agnostic design: any pair of pre- and post-unlearning classifiers can be analyzed identically within the LIME-based pipeline.
- Bias mitigation is achieved without loss of detection performance.
Limitations and Recommendations:
- Reliance on LIME’s perturbation sampling may limit attribution stability; integrating alternative explainability methods (e.g., SHAP, Integrated Gradients) could offer finer or more robust attributions.
- Key evaluation metrics, such as deletion/insertion curves and attribution sparsity, are not included in this initial study; their adoption is encouraged for comprehensive faithfulness assessment.
- Superpixel number, kernel width , and threshold require re-tuning when porting to new sonar datasets or transferring to other modalities (e.g., medical ultrasound).
This suggests that while UESF currently delivers transparent model unlearning in sonar image analysis, future studies should extend and systematize its attribution and validation toolkit for broader deployment.
7. Context, Impact, and Related Approaches
UESF addresses a critical challenge in sonar image classification—over-reliance on seafloor context compromising model generalization. By integrating targeted machine unlearning with interpretable explainability, UESF exemplifies a paradigm for robust, bias-aware classifier development in high-stakes imagery domains. Its modular, model-agnostic structure positions it as a reference approach for evaluating and visualizing the effect of deliberate feature forgetting, with potential applicability to other image modalities facing analogous confounding.
For further methodological and implementation details, see (S et al., 1 Dec 2025).