MANO: Exploiting Matrix Norm for Unsupervised Accuracy Estimation Under Distribution Shifts (2405.18979v3)
Abstract: Leveraging the models' outputs, specifically the logits, is a common approach to estimating the test accuracy of a pre-trained neural network on out-of-distribution (OOD) samples without requiring access to the corresponding ground truth labels. Despite their ease of implementation and computational efficiency, current logit-based methods are vulnerable to overconfidence issues, leading to prediction bias, especially under the natural shift. In this work, we first study the relationship between logits and generalization performance from the view of low-density separation assumption. Our findings motivate our proposed method MaNo which (1) applies a data-dependent normalization on the logits to reduce prediction bias, and (2) takes the $L_p$ norm of the matrix of normalized logits as the estimation score. Our theoretical analysis highlights the connection between the provided score and the model's uncertainty. We conduct an extensive empirical study on common unsupervised accuracy estimation benchmarks and demonstrate that MaNo achieves state-of-the-art performance across various architectures in the presence of synthetic, natural, or subpopulation shifts. The code is available at \url{https://github.com/Renchunzi-Xie/MaNo}.
- Self-training: A survey. arXiv preprint arXiv:2202.12040.
- Bishop, C. M. (2006). Pattern Recognition and Machine Learning, page 738. Springer.
- Learning classifiers with Fenchel-Young losses: Generalized entropies, margins, and algorithms. In Chaudhuri, K. and Sugiyama, M., editors, Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics, pages 606–615.
- Learning with Fenchel-Young losses. J. Mach. Learn. Res., 21(1).
- Semi-supervised classification by low density separation. In Proceedings of the Tenth International Workshop on Artificial Intelligence and Statistics, pages 57–64.
- Entropy-SGD: Biasing gradient descent into wide valleys. Journal of Statistical Mechanics: Theory and Experiment, 2019(12):124018.
- Noisy softmax: Improving the generalization ability of dcnn via postponing the early softmax saturation. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4021–4030.
- A simple framework for contrastive learning of visual representations. In International Conference on Machine Learning, pages 1597–1607.
- The loss surfaces of multilayer networks. In Artificial Intelligence and Statistics, pages 192–204.
- Minimally distorted adversarial examples with a fast adaptive boundary attack. In International Conference on Machine Learning, pages 2196–2205.
- Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. Advances in Neural Information Processing Systems, 27.
- An exploration of softmax alternatives belonging to the spherical loss family. In International Conference on Learning Representations, (ICLR).
- ImageNet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 248–255. Ieee.
- ArcFace: Additive angular margin loss for deep face recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4690–4699.
- What does rotation prediction tell us about classifier accuracy under varying testing environments? In International Conference on Machine Learning (ICML), pages 2579–2589.
- Confidence and dispersity speak: Characterising prediction matrix for unsupervised accuracy estimation. arXiv preprint arXiv:2302.01094.
- Are labels always necessary for classifier accuracy evaluation? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15069–15078.
- Sharp minima can generalize for deep nets. In International Conference on Machine Learning, pages 1019–1028.
- Dohmatob, E. (2020). Distance from a point to a hyperplane. https://math.stackexchange.com/questions/1210545/distance-from-a-point-to-a-hyperplane.
- Unsupervised supervised learning i: Estimating classification and regression errors without labels. Journal of Machine Learning Research, 11(4).
- Adversarially adaptive normalization for single domain generalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8208–8217.
- Empirical study of the topology and geometry of deep networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3762–3770.
- Random matrix analysis to balance between supervised and unsupervised learning under the low density separation assumption. In Proceedings of the 40th International Conference on Machine Learning, pages 10008–10033.
- Topology and geometry of half-rectified network optimization. arXiv preprint arXiv:1611.01540.
- Leveraging unlabeled data to predict out-of-distribution performance. arXiv preprint arXiv:2201.04234.
- ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness. arXiv preprint arXiv:1811.12231.
- Nonextensive Entropy: Interdisciplinary Applications .
- Gini, C. (1912). Variabilità e mutabilità: contributo allo studio delle distribuzioni e delle relazioni statistiche. [Fasc. I.]. Studi economico-giuridici pubblicati per cura della facoltà di Giurisprudenza della R. Università di Cagliari. Tipogr. di P. Cuppini.
- Predicting with confidence on unseen distributions. In Proceedings of the IEEE/CVF international Conference on Computer Vision (ICCV), pages 1134–1144.
- Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778.
- Decision boundary analysis of adversarial examples. In International Conference on Learning Representations.
- Benchmarking neural network robustness to common corruptions and perturbations. arXiv preprint arXiv:1903.12261.
- A baseline for detecting misclassified and out-of-distribution examples in neural networks. arXiv preprint arXiv:1610.02136.
- X-risk analysis for ai research. arXiv preprint arXiv:2206.05862.
- Knowledge distillation with adversarial samples supporting decision boundary. In Proceedings of the AAAI conference on artificial intelligence, volume 33, pages 3771–3778.
- Assessing generalization of SGD via disagreement. arXiv preprint arXiv:2106.13799.
- Characterizing the decision boundary of deep neural networks. arXiv preprint arXiv:1912.11460.
- Kendall, M. G. (1948). Rank correlation methods. Michigan University.
- Wilds: A benchmark of in-the-wild distribution shifts. In International Conference on Machine Learning, pages 5637–5664.
- Learning multiple layers of features from tiny images. Technical Report.
- Tiny ImageNet visual recognition challenge. CS 231N, 7(7):3.
- Multi-class data description for out-of-distribution detection. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 1362–1370.
- Lee, D.-H. (2013). Pseudo-Label : The Simple and Efficient Semi-Supervised Learning Method for Deep Neural Networks. ICML 2013 Workshop : Challenges in Representation Learning (WREPL).
- Deeper, broader and artier domain generalization. In Proceedings of the IEEE international Conference on Computer Vision, pages 5542–5550.
- A new adversarial domain generalization network based on class boundary feature detection for bearing fault diagnosis. IEEE Transactions on Instrumentation and Measurement, 71:1–9.
- On the decision boundary of deep neural networks. arXiv preprint arXiv:1808.05385.
- Sphereface: Deep hypersphere embedding for face recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 212–220.
- SGDR: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983.
- Characterizing out-of-distribution error via optimal transport. arXiv preprint arXiv:2305.15640.
- Characterizing out-of-distribution error via optimal transport. Advances in Neural Information Processing Systems, 36.
- Co-validation: Using model disagreement on unlabeled data to validate classification algorithms. Advances in Neural Information Processing Systems (NeurIPS), 17.
- Geometric losses for distributional learning. In Proceedings of the 36th International Conference on Machine Learning, pages 4516–4525.
- Understanding the decision boundary of deep neural networks: An empirical study. arXiv preprint arXiv:2002.01810.
- On interaction between augmentations and corruptions in natural corruption robustness. Advances in Neural Information Processing Systems, 34:3571–3583.
- On the number of linear regions of deep neural networks. Advances in Neural Information Processing Systems, 27.
- Tsallis regularized optimal transport and ecological inference. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, AAAI’17, page 2387–2393.
- Nagelkerke, N. J. et al. (1991). A note on a general definition of the coefficient of determination. Biometrika, 78(3):691–692.
- Orbit regularization. In Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N., and Weinberger, K., editors, Advances in Neural Information Processing Systems, volume 27.
- Leveraging ensemble diversity for robust self-training in the presence of sample selection bias. arXiv preprint arXiv:2310.14814.
- Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748.
- Energy-based automated model evaluation. arXiv preprint arXiv:2401.12689.
- Moment matching for multi-source domain adaptation. In Proceedings of the IEEE/CVF international Conference on Computer Vision, pages 1406–1415.
- Sparse sequence-to-sequence models. In Korhonen, A., Traum, D., and Màrquez, L., editors, Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1504–1519.
- Estimating accuracy from unlabeled data: A probabilistic logic approach. Advances in Neural Information Processing Systems (NeurIPS), 30.
- Estimating accuracy from unlabeled data: A bayesian approach. In International Conference on Machine Learning (ICML), pages 1416–1425.
- Exponential expressivity in deep neural networks through transient chaos. Advances in Neural Information Processing Systems, 29.
- Dataset shift in machine learning.
- L2-constrained softmax loss for discriminative face verification. arXiv preprint arXiv:1703.09507.
- Do imagenet classifiers generalize to imagenet? In International Conference on Machine Learning, pages 5389–5400.
- Breeds: Benchmarks for subpopulation shift. arXiv preprint arXiv:2008.04859.
- Pac-bayesian analysis of co-clustering and beyond. Journal of Machine Learning Research, 11(117):3595–3646.
- Learning to optimize domain specific normalization for domain generalization. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXII 16, pages 68–83. Springer.
- Sneddon, R. (2007). The Tsallis entropy of natural information. Physica A: Statistical Mechanics and its Applications, 386(1):101–118.
- Sohn, K. (2016). Improved deep metric learning with multi-class n-pair loss objective. Advances in Neural Information Processing Systems, 29.
- Fixmatch: simplifying semi-supervised learning with consistency and confidence. In Proceedings of the 34th International Conference on Neural Information Processing Systems.
- Rxrx1: An image set for cellular morphological variation across many experimental batches. In International Conference on Learning Representations (ICLR).
- Inflation based on the Tsallis entropy. The European Physical Journal C, 84(1):80.
- Exploring covariate and concept shift for out-of-distribution detection. In NeurIPS 2021 Workshop on Distribution Shifts: Connecting Methods and Applications.
- Tsallis, C. (1988). Possible generalization of Boltzmann-Gibbs statistics. Journal of Statistical Physics, 52(1):479–487.
- Deep hashing network for unsupervised domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5018–5027.
- Tent: Fully test-time adaptation by entropy minimization. In International Conference on Learning Representations.
- Normface: L2 hypersphere embedding for face verification. In Proceedings of the 25th ACM international conference on Multimedia, pages 1041–1049.
- Mitigating neural network overconfidence with logit normalization. arXiv preprint arXiv:2205.09310.
- Mitigating neural network overconfidence with logit normalization. In Proceedings of the 39th International Conference on Machine Learning, pages 23631–23644.
- Unsupervised feature learning via non-parametric instance discrimination. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3733–3742.
- On the importance of feature separability in predicting out-of-distribution error. arXiv preprint arXiv:2303.15488.
- Yousefzadeh, R. (2021). Deep learning generalization and the convex hull of training sets. arXiv preprint arXiv:2101.09849.
- Predicting out-of-distribution error with the projection norm. arXiv preprint arXiv:2202.05834.
- Wide residual networks. In British Machine Vision Conference (BMVC).
- Adacos: Adaptively scaling cosine logits for effectively learning deep face representations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10823–10832.