Calibrating Transformers via Sparse Gaussian Processes (2303.02444v3)
Abstract: Transformer models have achieved profound success in prediction tasks in a wide range of applications in natural language processing, speech recognition and computer vision. Extending Transformer's success to safety-critical domains requires calibrated uncertainty estimation which remains under-explored. To address this, we propose Sparse Gaussian Process attention (SGPA), which performs Bayesian inference directly in the output space of multi-head attention blocks (MHAs) in transformer to calibrate its uncertainty. It replaces the scaled dot-product operation with a valid symmetric kernel and uses sparse Gaussian processes (SGP) techniques to approximate the posterior processes of MHA outputs. Empirically, on a suite of prediction tasks on text, images and graphs, SGPA-based Transformers achieve competitive predictive accuracy, while noticeably improving both in-distribution calibration and out-of-distribution robustness and detection.
- Weight uncertainty in neural networks. In International Conference on Machine Learning, 2015.
- Adversarial examples, uncertainty, and transfer testing robustness in gaussian process hybrid deep networks. arXiv preprint arXiv:1707.02476, 2017.
- Language models are few-shot learners. arXiv:2005.14165, 2020.
- Variational inference for gaussian process models with linear complexity. In Advances in Neural Processing Information Systems, 2017.
- Pathologies in priors and inference for bayesian transformers. In NeurIPS 2021 ”I can’t believe it’s not better” workshop, 2021.
- Wide mean-field variational bayesian neural networks ignore the data. In Proceedings of the 25th International Conference on Artificial Intelligence and Statistics (AISTATS), 2022.
- Deep gaussian processes. In International Conference on Artificial Intelligence and Statistics, 2013.
- Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT, 2019.
- An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021.
- Additive gaussian processes. In Advances in Neural Information Processing Systems, 2011.
- Benchmarking graph neural networks. arXiv preprint arXiv:2003.00982, 2020.
- Bayesian attention modules. In Advances in Neural Information Processing Systems, 2020.
- On the expressiveness of approximate inference in bayesian neural networks. In Advances in Neural Information Processing Systems, 2020.
- Yarin Gal. Uncertainty in deep learning. PhD dissertation, University of Cambridge, 2016.
- Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In International Conference on Machine Learning, 2016.
- Hybrid speech recognition with deep bidirectional lstm. In 2013 IEEE workshop on automatic speech recognition and understanding, pp. 273–278. IEEE, 2013.
- On calibration of modern neural networks. In International Conference on Machine Learning, 2017a.
- On calibration of modern neural networks. In International Conference on Machine Learning, 2017b.
- Deep residual learning for image recognition. arXiv:1512.03385, 2015.
- D. Hendrycks and T. Dietterich. Benchmarking neural network robustness to common corruptions and perturbations. In International Conference on Learning Representations, 2019.
- Gaussian processes for big data. In The Conference on Uncertainty in Artificial Intelligence, 2013.
- Input dependent sparse gaussian processes. In International Conference on Machine Learning, 2022.
- Pure transformers are powerful graph learners. arXiv, abs/2207.02505, 2022. URL https://arxiv.org/abs/2207.02505.
- Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015.
- Being bayesian, even just a bit, fixes overconfidence in relu networks. In International Conference on Machine Learning, 2020.
- Cifar-10 and cifar-100 datasets. 2009. URL https://www.cs.toronto.edu/kriz/cifar.html.
- Simple and scalable predictive uncertainty estimation using deep ensembles. In Advances in neural information processing systems, 2017.
- Simple and principled uncertainty estimation with deterministic deep learning via distance awareness. In Advances in Neural Information Processing Systems, 2020.
- Learning word vectors for sentiment analysis. In NAACL-HLT, 2011.
- B. W. Matthews. Comparison of the predicted and observed secondary structure of t4 phage lysozyme. Biochimica et Biophysica Acta (BBA)-Protein Structure, 405(2): 442–451, 1975.
- Calibrating deep neural networks using focal loss. In Advances in Neural Information Processing Systems, 2020.
- Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, pp. 8024–8035. Curran Associates, Inc., 2019.
- Deep contextualized word representations. In Conference of the North American Chapter of the Association for Computational Linguistics, 2018.
- Gaussian Processes for Machine Learning. 2006. ISBN-13 978-0-262-18253-9.
- Sparse uncertainty representation in deep learning with inducing weights. Advances in Neural Information Processing Systems, 34:6515–6528, 2021.
- Doubly stochastic variational inference for deep gaussian processes. In Advances in Neural Information Processing Systems, 2017.
- Orthogonally decoupled variational gaussian processes. In Advances in Neural Information Processing Systems, 2018.
- Sparse gaussian processes using pseudo-inputs. In Advances in Neural Information Processing Systems, 2005.
- Scalable variational gaussian processes via harmonic kernel decomposition. In International Conference on Machine Learning, 2021.
- Michalis Titsias. Variational learning of inducing variables in sparse gaussian processes. In Artificial Intelligence and Statistics, 2009.
- Deit iii: Revenge of the vit. arXiv preprint arXiv:2204.07118, 2022.
- Bayesian layers: A module for neural network uncertainty. arXiv:1812.03973, 2019.
- Sparse within sparse gaussian processes using neighbor information. In International Conference on Machine Learning, 2021.
- Transformer dissection: An unified understanding for transformer’s attention via the lens of kernel. In EMNLP, 2019.
- Attention is all you need. In Advances in Neural Information Processing Systems, 2017.
- Neural network acceptability judgments. In Transactions of the Association for Computational Linguistics, 2019.
- Deep kernel learning. In International Conference on Artificial Intelligence and Statistics, 2016.
- Efficiently sampling functions from gaussian process posteriors. In International Conference on Machine Learning, 2020.
- Bayesian transformer language models for speech recognition. arXiv:2102.04754, 2021.
- Cyclical stochastic gradient mcmc for bayesian deep learning. In International Conference on Learning Representations, 2019.