Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
157 tokens/sec
GPT-4o
43 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Calibrating Transformers via Sparse Gaussian Processes (2303.02444v3)

Published 4 Mar 2023 in cs.LG and stat.ML

Abstract: Transformer models have achieved profound success in prediction tasks in a wide range of applications in natural language processing, speech recognition and computer vision. Extending Transformer's success to safety-critical domains requires calibrated uncertainty estimation which remains under-explored. To address this, we propose Sparse Gaussian Process attention (SGPA), which performs Bayesian inference directly in the output space of multi-head attention blocks (MHAs) in transformer to calibrate its uncertainty. It replaces the scaled dot-product operation with a valid symmetric kernel and uses sparse Gaussian processes (SGP) techniques to approximate the posterior processes of MHA outputs. Empirically, on a suite of prediction tasks on text, images and graphs, SGPA-based Transformers achieve competitive predictive accuracy, while noticeably improving both in-distribution calibration and out-of-distribution robustness and detection.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (50)
  1. Weight uncertainty in neural networks. In International Conference on Machine Learning, 2015.
  2. Adversarial examples, uncertainty, and transfer testing robustness in gaussian process hybrid deep networks. arXiv preprint arXiv:1707.02476, 2017.
  3. Language models are few-shot learners. arXiv:2005.14165, 2020.
  4. Variational inference for gaussian process models with linear complexity. In Advances in Neural Processing Information Systems, 2017.
  5. Pathologies in priors and inference for bayesian transformers. In NeurIPS 2021 ”I can’t believe it’s not better” workshop, 2021.
  6. Wide mean-field variational bayesian neural networks ignore the data. In Proceedings of the 25th International Conference on Artificial Intelligence and Statistics (AISTATS), 2022.
  7. Deep gaussian processes. In International Conference on Artificial Intelligence and Statistics, 2013.
  8. Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT, 2019.
  9. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021.
  10. Additive gaussian processes. In Advances in Neural Information Processing Systems, 2011.
  11. Benchmarking graph neural networks. arXiv preprint arXiv:2003.00982, 2020.
  12. Bayesian attention modules. In Advances in Neural Information Processing Systems, 2020.
  13. On the expressiveness of approximate inference in bayesian neural networks. In Advances in Neural Information Processing Systems, 2020.
  14. Yarin Gal. Uncertainty in deep learning. PhD dissertation, University of Cambridge, 2016.
  15. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In International Conference on Machine Learning, 2016.
  16. Hybrid speech recognition with deep bidirectional lstm. In 2013 IEEE workshop on automatic speech recognition and understanding, pp.  273–278. IEEE, 2013.
  17. On calibration of modern neural networks. In International Conference on Machine Learning, 2017a.
  18. On calibration of modern neural networks. In International Conference on Machine Learning, 2017b.
  19. Deep residual learning for image recognition. arXiv:1512.03385, 2015.
  20. D. Hendrycks and T. Dietterich. Benchmarking neural network robustness to common corruptions and perturbations. In International Conference on Learning Representations, 2019.
  21. Gaussian processes for big data. In The Conference on Uncertainty in Artificial Intelligence, 2013.
  22. Input dependent sparse gaussian processes. In International Conference on Machine Learning, 2022.
  23. Pure transformers are powerful graph learners. arXiv, abs/2207.02505, 2022. URL https://arxiv.org/abs/2207.02505.
  24. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015.
  25. Being bayesian, even just a bit, fixes overconfidence in relu networks. In International Conference on Machine Learning, 2020.
  26. Cifar-10 and cifar-100 datasets. 2009. URL https://www.cs.toronto.edu/kriz/cifar.html.
  27. Simple and scalable predictive uncertainty estimation using deep ensembles. In Advances in neural information processing systems, 2017.
  28. Simple and principled uncertainty estimation with deterministic deep learning via distance awareness. In Advances in Neural Information Processing Systems, 2020.
  29. Learning word vectors for sentiment analysis. In NAACL-HLT, 2011.
  30. B. W. Matthews. Comparison of the predicted and observed secondary structure of t4 phage lysozyme. Biochimica et Biophysica Acta (BBA)-Protein Structure, 405(2): 442–451, 1975.
  31. Calibrating deep neural networks using focal loss. In Advances in Neural Information Processing Systems, 2020.
  32. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, pp. 8024–8035. Curran Associates, Inc., 2019.
  33. Deep contextualized word representations. In Conference of the North American Chapter of the Association for Computational Linguistics, 2018.
  34. Gaussian Processes for Machine Learning. 2006. ISBN-13 978-0-262-18253-9.
  35. Sparse uncertainty representation in deep learning with inducing weights. Advances in Neural Information Processing Systems, 34:6515–6528, 2021.
  36. Doubly stochastic variational inference for deep gaussian processes. In Advances in Neural Information Processing Systems, 2017.
  37. Orthogonally decoupled variational gaussian processes. In Advances in Neural Information Processing Systems, 2018.
  38. Sparse gaussian processes using pseudo-inputs. In Advances in Neural Information Processing Systems, 2005.
  39. Scalable variational gaussian processes via harmonic kernel decomposition. In International Conference on Machine Learning, 2021.
  40. Michalis Titsias. Variational learning of inducing variables in sparse gaussian processes. In Artificial Intelligence and Statistics, 2009.
  41. Deit iii: Revenge of the vit. arXiv preprint arXiv:2204.07118, 2022.
  42. Bayesian layers: A module for neural network uncertainty. arXiv:1812.03973, 2019.
  43. Sparse within sparse gaussian processes using neighbor information. In International Conference on Machine Learning, 2021.
  44. Transformer dissection: An unified understanding for transformer’s attention via the lens of kernel. In EMNLP, 2019.
  45. Attention is all you need. In Advances in Neural Information Processing Systems, 2017.
  46. Neural network acceptability judgments. In Transactions of the Association for Computational Linguistics, 2019.
  47. Deep kernel learning. In International Conference on Artificial Intelligence and Statistics, 2016.
  48. Efficiently sampling functions from gaussian process posteriors. In International Conference on Machine Learning, 2020.
  49. Bayesian transformer language models for speech recognition. arXiv:2102.04754, 2021.
  50. Cyclical stochastic gradient mcmc for bayesian deep learning. In International Conference on Learning Representations, 2019.
Citations (9)

Summary

We haven't generated a summary for this paper yet.