Self-Attention through Kernel-Eigen Pair Sparse Variational Gaussian Processes (2402.01476v2)
Abstract: While the great capability of Transformers significantly boosts prediction accuracy, it could also yield overconfident predictions and require calibrated uncertainty estimation, which can be commonly tackled by Gaussian processes (GPs). Existing works apply GPs with symmetric kernels under variational inference to the attention kernel; however, omitting the fact that attention kernels are in essence asymmetric. Moreover, the complexity of deriving the GP posteriors remains high for large-scale data. In this work, we propose Kernel-Eigen Pair Sparse Variational Gaussian Processes (KEP-SVGP) for building uncertainty-aware self-attention where the asymmetry of attention kernels is tackled by Kernel SVD (KSVD) and a reduced complexity is acquired. Through KEP-SVGP, i) the SVGP pair induced by the two sets of singular vectors from KSVD w.r.t. the attention kernel fully characterizes the asymmetry; ii) using only a small set of adjoint eigenfunctions from KSVD, the derivation of SVGP posteriors can be based on the inversion of a diagonal matrix containing singular values, contributing to a reduction in time complexity; iii) an evidence lower bound is derived so that variational parameters and network weights can be optimized with it. Experiments verify our excellent performances and efficiency on in-distribution, distribution-shift and out-of-distribution benchmarks.
- Attention is all you need. Advances in Neural Information Processing Systems, 30, 2017.
- Language models are few-shot learners. Advances in Neural Information Processing Systems, 33:1877–1901, 2020.
- An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021.
- Training data-efficient image transformers & distillation through attention. In International Conference on Machine Learning, pages 10347–10357, 2021.
- Flowformer: Linearizing transformers with conservation flows. In International Conference on Machine Learning, pages 24226–24242, 2022.
- On calibration of modern neural networks. In International Conference on Machine Learning, pages 1321–1330. PMLR, 2017.
- Calibrating deep neural networks using focal loss. Advances in Neural Information Processing Systems, 33:15288–15299, 2020.
- Confidence-aware learning for deep neural networks. In International Conference on Machine Learning, pages 7034–7044, 2020.
- Openmix: Exploring outlier samples for misclassification detection. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12074–12083, 2023.
- Weight uncertainty in neural network. In International Conference on Machine Learning, pages 1613–1622, 2015.
- Dropout as a Bayesian approximation: Representing model uncertainty in deep learning. In International Conference on Machine Learning, 2016.
- What uncertainties do we need in Bayesian deep learning for computer vision? Advances in Neural Information Processing Systems, 30, 2017.
- Doubly stochastic variational inference for deep Gaussian processes. Advances in Neural Information Processing Systems, 30, 2017.
- Bias-reduced uncertainty estimation for deep neural classifiers. International Conference on Learning Representations, 2019.
- Cyclical stochastic gradient MCMC for Bayesian deep learning. In International Conference on Learning Representations, 2020.
- On the expressiveness of approximate inference in Bayesian neural networks. Advances in Neural Information Processing Systems, 33:15897–15908, 2020.
- Sparse uncertainty representation in deep learning with inducing weights. Advances in Neural Information Processing Systems, 34:6515–6528, 2021.
- Wide mean-field Bayesian neural networks ignore the data. In International Conference on Artificial Intelligence and Statistics, 2022.
- Alex Graves. Practical variational inference for neural networks. Advances in Neural Information Processing Systems, 24, 2011.
- Bayesian layers: A module for neural network uncertainty. Advances in Neural Information Processing Systems, 32, 2019.
- Bayesian transformer language models for speech recognition. In IEEE International Conference on Acoustics, Speech and Signal Processing, pages 7378–7382. IEEE, 2021.
- Bayesian attention modules. Advances in Neural Information Processing Systems, 33:16362–16376, 2020.
- Pathologies in priors and inference for Bayesian transformers. In the Fourth Symposium on Advances in Approximate Bayesian Inference, 2022.
- Simple and principled uncertainty estimation with deterministic deep learning via distance awareness. Advances in Neural Information Processing Systems, 33:7498–7512, 2020.
- Calibrating transformers via sparse Gaussian processes. In International Conference on Learning Representations, 2023.
- Gaussian processes for machine learning. Springer, 2006.
- Michalis Titsias. Variational learning of inducing variables in sparse Gaussian processes. In International Conference on Artificial Intelligence and Statistics, pages 567–574, 2009.
- Transformer dissection: An unified understanding for transformer’s attention via the lens of kernel. In Conference on Empirical Methods in Natural Language Processing and the International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4344–4353, 2019.
- Transformers are deep infinite-dimensional non-Mercer binary kernel machines. arXiv preprint arXiv:2106.01506, 2021.
- Primal-attention: Self-attention through asymmetric kernel SVD in primal representation. Advances in Neural Information Processing Systems, 2023.
- Johan AK Suykens. SVD revisited: A new variational principle, compatible feature maps and nonlinear extensions. Applied and Computational Harmonic Analysis, 40(3):600–609, 2016.
- Nonlinear SVD with asymmetric kernels: feature learning and asymmetric Nyström method. arXiv preprint arXiv:2306.07040, 2023.
- Erhard Schmidt. Zur theorie der linearen und nichtlinearen integralgleichungen. Mathematische Annalen, 63(4):433–476, 1907.
- Gilbert W Stewart. On the early history of the singular value decomposition. SIAM Review, 35(4):551–566, 1993.
- Inter-domain Gaussian processes for sparse inference using inducing features. Advances in Neural Information Processing Systems, 22, 2009.
- A tutorial on sparse Gaussian processes and variational inference. arXiv preprint arXiv:2012.13962, 2020.
- Using the Nyström method to speed up kernel machines. In T. Leen, T. Dietterich, and V. Tresp, editors, Advances in Neural Information Processing Systems, volume 13, 2000.
- The approximation of one matrix by another of lower rank. Psychometrika, 1(3):211–218, 1936.
- Linformer: Self-attention with linear complexity. arXiv preprint arXiv:2006.04768, 2020.
- Rethinking attention with performers. In International Conference on Learning Representations, 2021.
- Fourierformer: Transformer meets generalized fourier integral theorem. Advances in Neural Information Processing Systems, 35:29319–29335, 2022.
- Kerple: Kernelized relative positional embedding for length extrapolation. Advances in Neural Information Processing Systems, 35:8386–8399, 2022.
- A primal-dual framework for transformers and neural networks. In International Conference on Learning Representations, 2023.
- Cornelius Lanczos. Linear systems in self-adjoint form. The American Mathematical Monthly, 65(9):665–679, 1958.
- A baseline for detecting misclassified and out-of-distribution examples in neural networks. In International Conference on Learning Representations, 2017.
- Learning multiple layers of features from tiny images. 2009.
- Learning word vectors for sentiment analysis. In Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 142–150, 2011.
- Neural network acceptability judgments. Transactions of the Association for Computational Linguistics, 7:625–641, 2019.
- Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
- Being Bayesian, even just a bit, fixes overconfidence in ReLU networks. In International Conference on Machine Learning, pages 5436–5446, 2020.
- Simple and scalable predictive uncertainty estimation using deep ensembles. Advances in Neural Information Processing Systems, 30, 2017.
- Selective classification for deep neural networks. Advances in Neural Information Processing Systems, 30, 2017.
- The relationship between precision-recall and ROC curves. In International Conference on Machine Learning, pages 233–240, 2006.
- Obtaining well calibrated probabilities using Bayesian binning. In AAAI Conference on Artificial Intelligence, volume 29, 2015.
- Glenn W Brier. Verification of forecasts expressed in terms of probability. Monthly Weather Review, 78(1):1–3, 1950.
- Brian W Matthews. Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochimica et Biophysica Acta (BBA)-Protein Structure, 405(2):442–451, 1975.
- Benchmarking neural network robustness to common corruptions and perturbations. International Conference on Learning Representations, 2019.
- Reading digits in natural images with unsupervised feature learning. 2011.
- Lsun: Construction of a large-scale image dataset using deep learning with humans in the loop. arXiv preprint arXiv:1506.03365, 2015.
- Deep Gaussian processes. In International Conference on Artificial Intelligence and Statistics, pages 207–215, 2013.
- Adam: A method for stochastic optimization. International Conference on Learning Representations, 2015.
- Deep contextualized word representations. Conference of the North American Chapter of the Association for Computational Linguistics, 2018.