Provable Robust Saliency-based Explanations (2212.14106v4)
Abstract: To foster trust in machine learning models, explanations must be faithful and stable for consistent insights. Existing relevant works rely on the $\ell_p$ distance for stability assessment, which diverges from human perception. Besides, existing adversarial training (AT) associated with intensive computations may lead to an arms race. To address these challenges, we introduce a novel metric to assess the stability of top-$k$ salient features. We introduce R2ET which trains for stable explanation by efficient and effective regularizer, and analyze R2ET by multi-objective optimization to prove numerical and statistical stability of explanations. Moreover, theoretical connections between R2ET and certified robustness justify R2ET's stability in all attacks. Extensive experiments across various data modalities and model architectures show that R2ET achieves superior stability against stealthy attacks, and generalizes effectively across different explanation methods.
- Sanity checks for saliency maps. NeurIPS, 31, 2018.
- Agarwal, S. The infinite push: A new support vector ranking algorithm that directly optimizes accuracy at the absolute top of the list. In SDM, 2011.
- Explaining deep neural networks with a polynomial time algorithm for shapley value approximation. In ICML, 2019.
- The multiplicative weights update method: a meta-algorithm and applications. Theory of Computing, 2012.
- Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. In ICML, 2018.
- Explainability techniques for graph convolutional networks. ICML, 2019.
- Smooth loss functions for deep top-k classification. ICLR, 2018.
- Proper network interpretability helps adversarial robustness in classification. In ICML, 2020.
- On the complexity of finding first-order critical points in constrained nonlinear optimization. Mathematical Programming, 2014.
- Neural network attributions: A causal perspective. In ICML, 2019.
- Self-learn to explain siamese networks robustly. ICDM, 2021.
- L-shapley and c-shapley: Efficient model interpretation for structured data. ICLR, 2019a.
- Robust attribution regularization. NeurIPS, 2019b.
- Improving adversarial robustness via unlabeled out-of-domain data. In AISTATS, 2021.
- Eraser: A benchmark to evaluate rationalized nlp models. ACL, 2019.
- Explanations can be manipulated and geometry is to blame. NeurIPS, 2019.
- Towards robust explanations for deep neural networks. Pattern Recognition, 2021.
- On the adversarial robustness of causal algorithmic recourse. In ICML, 2022.
- Steepest descent methods for multicriteria optimization. Mathematical Methods of Operations Research, 2000.
- A method for constrained multiobjective optimization based on sqp techniques. SIAM Journal on Optimization, 2016.
- Complexity of gradient descent for multiobjective optimization. Optimization Methods and Software, 2019.
- Interpretation of neural networks is fragile. In AAAI, 2019.
- European union regulations on algorithmic decision-making and a “right to explanation”. AI magazine, 2017.
- Ranking robustness under adversarial document manipulations. In SIGIR, 2018.
- Deep residual learning for image recognition. In CVPR, 2016.
- Formal guarantees on the robustness of a classifier against adversarial manipulation. NeurIPS, 30, 2017.
- Fooling neural network interpretations via adversarial model manipulation. NeurIPS, 2019.
- Graphlime: Local interpretable model explanations for graph neural networks. TKDE, 2022.
- Learning with a strong adversary. ICLR, 2016.
- Neural networks and artificial intelligence for biomedical engineering. Wiley Online Library, 2000.
- Far: A general framework for attributional robustness. BMVC, 2020.
- Learning multiple layers of features from tiny images. 2009.
- Top-k multiclass svm. NeurIPS, 2015.
- Loss functions for top-k error: Analysis and insights. In CVPR, 2016.
- Analysis and optimization of loss functions for multiclass, top-k, and multilabel classification. PAMI, 2017.
- Gradient-based learning applied to document recognition. Proceedings of the IEEE, 1998.
- Towards understanding fast adversarial training. arXiv preprint arXiv:2006.03089, 2020.
- Qair: Practical query-efficient black-box attacks for image retrieval. In CVPR, 2021.
- On gradient descent ascent for nonconvex-concave minimax problems. In ICML, 2020.
- Certifiably robust interpretation via rényi differential privacy. Artificial Intelligence, 2022.
- Shapley values and meta-explanations for probabilistic graphical model inference. In CIKM, 2020.
- Dance: Enhancing saliency maps using decoys. In ICML, 2021.
- A unified approach to interpreting model predictions. NeurIPS, 2017.
- Multi-graph clustering based on interior-node topology with applications to brain networks. In ECML PKDD, 2016.
- Deep graph similarity learning for brain data analysis. In CIKM, 2019.
- Towards deep learning models resistant to adversarial attacks. ICLR, 2017.
- Improving attribution methods by learning submodular functions. In AISTATS, 2022.
- Robustness via curvature regularization, and vice versa. In CVPR, 2019.
- A data-driven approach to predict the success of bank telemarketing. Decision Support Systems, 2014.
- Explaining machine learning classifiers through diverse counterfactual explanations. In FAccT, 2020.
- Robust explainability: A tutorial on gradient-based attribution methods for deep neural networks. IEEE Signal Processing Magazine, 2022.
- Explaining visual models by causal attribution. In ICCVW, 2019.
- Analysis of different norms and corresponding lipschitz constants for global optimization. Technol. Econ. Dev, 2006.
- Pearl, J. Theoretical impediments to machine learning with seven sparks from the causal revolution. WSDM, 2018.
- Pearlmutter, B. A. Fast exact multiplication by the hessian. Neural computation, 1994.
- Differentiable top-k classification learning. In ICML, 2022.
- Trust building with explanation interfaces. In IUI, 2006.
- Understanding and mitigating the tradeoff between robustness and accuracy. ICML, 2020.
- ” why should i trust you?” explaining the predictions of any classifier. In SIGKDD, 2016.
- A simple defense against adversarial attacks on heatmap explanations. WHI, 2020.
- Interpretations are useful: penalizing explanations to align neural networks with prior knowledge. In ICML, 2020.
- Augmented lagrangian adversarial attacks. ICCV, 2020.
- Adversarial training is a form of data-dependent operator norm regularization. NeurIPS, 2020.
- Rudin, C. The P-Norm Push: A Simple Convex Ranking Algorithm That Concentrates at the Top of the List. JMLR, 2009.
- Why the magic number seven plus or minus two. Mathematical and computer modelling, 2003.
- Enhanced regularizers for attributional robustness. In AAAI, 2021.
- Grad-cam: Visual explanations from deep networks via gradient-based localization. In ICCV, 2017.
- Is attention interpretable? ACL, 2019.
- Learning important features through propagating activation differences. In ICML, 2017.
- Attributional robustness training using input-gradient spatial alignment. In ECCV, 2020.
- Smoothgrad: removing noise by adding noise. arXiv preprint arXiv:1706.03825, 2017.
- Axiomatic attribution for deep networks. In ICML, 2017.
- Robustness may be at odds with accuracy. ICLR, 2018.
- Theoretical analysis of adversarial learning: A minimax approach. NeurIPS, 2019.
- Towards robust and reliable algorithmic recourse. NeurIPS, 2021.
- Generalization error bounds for classifiers trained with interdependent data. In NeurIPS, 2005.
- Pgm-explainer: Probabilistic graphical model explanations for graph neural networks. NeurIPS, 2020.
- Exploiting the relationship between kendall’s rank correlation and cosine similarity for attribution protection. NeurIPS, 2022.
- A practical upper bound for the worst-case attribution deviations. CVPR, 2023.
- Transferable, controllable, and inconspicuous adversarial attacks on person re-identification with deep mis-ranking. In CVPR, 2020a.
- Robust ranking models via risk-sensitive optimization. In SIGIR, 2012.
- Using small business banking data for explainable credit risk scoring. In AAAI, 2020b.
- On the convergence and robustness of adversarial training. ICML, 2021.
- Smoothed geometry for robust attribution. NeurIPS, 2020c.
- Towards understanding the regularization of adversarial robustness on neural networks. In ICML, 2020.
- Robust explanation constraints for neural networks. ICLR, 2023.
- Fast is better than free: Revisiting adversarial training. ICLR, 2020.
- Robustness and regularization of support vector machines. JMLR, 2009.
- Adversarial attacks and defenses in images, graphs and text: A review. Int. J. Autom. Comput., 2020.
- On the consistency of top-k surrogate losses. In ICML, 2020.
- Boundary thickness and robustness in learning models. NeurIPS, 2020a.
- A closer look at accuracy vs. robustness. NeurIPS, 2020b.
- On explainability of graph neural networks via subgraph explorations. In ICML, 2021.
- Theoretically principled trade-off between robustness and accuracy. In ICML, 2019.
- Zhang, T. Covering Number Bounds of Certain Regularized Linear Function Classes. JMLR, 2002.
- Learning deep features for discriminative localization. In CVPR, 2016.
- Adversarial ranking attack and defense. In ECCV, 2020.
- Practical relative order attack in deep ranking. In ICCV, 2021.
- Ranking robustness: a novel framework to predict query performance. In CIKM, 2006.