Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Variance-Reducing Couplings for Random Features (2405.16541v2)

Published 26 May 2024 in stat.ML and cs.LG

Abstract: Random features (RFs) are a popular technique to scale up kernel methods in machine learning, replacing exact kernel evaluations with stochastic Monte Carlo estimates. They underpin models as diverse as efficient transformers (by approximating attention) to sparse spectrum Gaussian processes (by approximating the covariance function). Efficiency can be further improved by speeding up the convergence of these estimates: a variance reduction problem. We tackle this through the unifying lens of optimal transport, finding couplings to improve RFs defined on both Euclidean and discrete input spaces. They enjoy theoretical guarantees and sometimes provide strong downstream gains, including for scalable approximate inference on graphs. We reach surprising conclusions about the benefits and limitations of variance reduction as a paradigm, showing that other properties of the coupling should be optimised for attention estimation in efficient transformers.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (48)
  1. Non-backtracking random walks mix faster. Communications in Contemporary Mathematics, 9(04):585–603, 2007. URL https://doi.org/10.1142/S021919970700255.
  2. Random fourier features for kernel ridge regression: Approximation bounds and statistical guarantees. In International conference on machine learning, pages 253–262. PMLR, 2017a. URL https://doi.org/10.48550/arXiv.1804.09893.
  3. Random fourier features for kernel ridge regression: Approximation bounds and statistical guarantees. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, volume 70 of Proceedings of Machine Learning Research, pages 253–262. PMLR, 2017b. URL http://proceedings.mlr.press/v70/avron17a.html.
  4. On the almost exact-equivalence of the radial and spherical unconstrained choleskybased parameterization methods for correlation matrices. Technical report, Department of Civil, Architectural and Environmental Engineering, The University of Texas at Austin, 2021. URL https://repositories.lib.utexas.edu/server/api/core/bitstreams/69c04b9a-2175-4392-b7b0-a8f5445ad405/content.
  5. Patrick Billingsley. Convergence of probability measures. John Wiley & Sons, 2013.
  6. Structured adaptive and random spinners for fast machine learning computations. In Artificial intelligence and statistics, pages 1020–1029. PMLR, 2017. URL https://doi.org/10.48550/arXiv.1610.06209.
  7. Matérn gaussian processes on graphs. In International Conference on Artificial Intelligence and Statistics, pages 2593–2601. PMLR, 2021. URL https://doi.org/10.48550/arXiv.2010.15538.
  8. Colin Campbell. Kernel methods: a survey of current techniques. Neurocomputing, 48(1-4):63–84, 2002. URL https://doi.org/10.1016/S0925-2312(01)00643-9.
  9. Kernel methods and the exponential family. Neurocomputing, 69(7-9):714–720, 2006. URL https://doi.org/10.1016/j.neucom.2005.12.009.
  10. Cluster kernels for semi-supervised learning. Advances in neural information processing systems, 15, 2002. URL https://dl.acm.org/doi/10.5555/2968618.2968693.
  11. Freeway performance measurement system: mining loop detector data. Transportation research record, 1748(1):96–102, 2001. URL https://api.semanticscholar.org/CorpusID:108891582.
  12. Approximate optimal transport for continuous densities with copulas. In IJCAI, pages 2165–2171, 2019. URL https://www.ijcai.org/proceedings/2019/0300.pdf.
  13. The geometry of random features. In International Conference on Artificial Intelligence and Statistics, pages 1–9. PMLR, 2018. URL http://proceedings.mlr.press/v84/choromanski18a/choromanski18a.pdf.
  14. Rethinking attention with performers. arXiv preprint arXiv:2009.14794, 2020. URL https://doi.org/10.48550/arXiv.2009.14794.
  15. The unreasonable effectiveness of structured random orthogonal embeddings. Advances in neural information processing systems, 30, 2017. URL https://doi.org/10.48550/arXiv.1703.00864.
  16. Krzysztof Marcin Choromanski. Taming graph kernels with random features. In International Conference on Machine Learning, pages 5964–5977. PMLR, 2023. URL https://doi.org/10.48550/arXiv.2305.00156.
  17. Fan RK Chung. Spectral graph theory, volume 92. American Mathematical Soc., 1997.
  18. Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. Advances in neural information processing systems, 26, 2013. URL https://papers.nips.cc/paper_files/paper/2013/file/af21d0c97db2e27e13572cbf59eb343d-Paper.pdf.
  19. Gaussian quadrature for kernel features. Advances in neural information processing systems, 30, 2017. URL https://doi.org/10.48550/arXiv.1709.02605.
  20. A sparse johnson: Lindenstrauss transform. In Proceedings of the forty-second ACM symposium on Theory of computing, pages 341–350, 2010. URL https://doi.org/10.1145/1806689.1806737.
  21. An elementary proof of a theorem of johnson and lindenstrauss. Random Struct. Algorithms, 22(1):60–65, 2003. doi: 10.1002/RSA.10073. URL https://doi.org/10.1002/rsa.10073.
  22. Michael Dawson-Haggerty. Trimesh repository, 2023. URL https://github.com/mikedh/trimesh.
  23. High-dimensional integration: the quasi-monte carlo way. Acta Numerica, 22:133–288, 2013. URL https://doi.org/10.1017/S0962492913000044.
  24. Towards scaling fully personalized pagerank: Algorithms, lower bounds, and experiments. Internet Mathematics, 2(3):333–358, 2005. URL https://doi.org/10.1007/978-3-540-30216-2_9.
  25. Some guidelines and guarantees for common random numbers. Management Science, 38(6):884–908, 1992. URL https://doi.org/10.1287/mnsc.38.6.884.
  26. Approximation algorithms for max-3-cut and other problems via complex semidefinite programming. In Proceedings of the thirty-third annual ACM symposium on Theory of computing, pages 443–452, 2001. URL https://doi.org/10.1016/j.jcss.2003.07.012.
  27. John H Halton. On the efficiency of certain quasi-random sequences of points in evaluating multi-dimensional integrals. Numerische Mathematik, 2:84–90, 1960. URL https://doi.org/10.1007/BF01386213.
  28. JM Hammersley and KW Morton. A new monte carlo technique: antithetic variates. In Mathematical proceedings of the Cambridge philosophical society, volume 52, pages 449–475. Cambridge University Press, 1956. URL https://doi.org/10.1017/S0305004100031455.
  29. Martin Haugh. An introduction to copulas. quantitative risk management. Lecture Notes. New York: Columbia University, 2016. URL http://www.columbia.edu/~mh2078/QRM/Copulas.pdf.
  30. Vladimir Ivashkin. Community graphs repository, 2023. URL https://github.com/vlivashkin/community-graphs.
  31. William B Johnson. Extensions of lipschitz mappings into a hilbert space. Contemp. Math., 26:189–206, 1984. URL http://stanford.edu/class/cs114/readings/JL-Johnson.pdf.
  32. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014. URL https://doi.org/10.48550/arXiv.1412.6980.
  33. Diffusion kernels on graphs and other discrete structures. In Proceedings of the 19th international conference on machine learning, volume 2002, pages 315–322, 2002. URL https://www.ml.cmu.edu/research/dap-papers/kondor-diffusion-kernels.pdf.
  34. Kernel methods for learning languages. Theor. Comput. Sci., 405(3):223–236, 2008. doi: 10.1016/j.tcs.2008.06.037. URL https://doi.org/10.1016/j.tcs.2008.06.037.
  35. Harold W Kuhn. The hungarian method for the assignment problem. Naval research logistics quarterly, 2(1-2):83–97, 1955. URL https://doi.org/10.1002/nav.3800020109.
  36. Sparse spectrum gaussian process regression. The Journal of Machine Learning Research, 11:1865–1881, 2010. URL https://jmlr.csail.mit.edu/papers/volume11/lazaro-gredilla10a/lazaro-gredilla10a.pdf.
  37. Fastfood-approximating kernel expansions in loglinear time. In Proceedings of the international conference on machine learning, volume 85, 2013. URL https://doi.org/10.48550/arXiv.1408.3060.
  38. Chefs’ random tables: Non-trigonometric random features. Advances in Neural Information Processing Systems, 35:34559–34573, 2022. URL https://doi.org/10.48550/arXiv.2205.15317.
  39. Random features for kernel approximation: A survey on algorithms, theory, and beyond. IEEE Trans. Pattern Anal. Mach. Intell., 44(10):7128–7148, 2022. doi: 10.1109/TPAMI.2021.3097011. URL https://doi.org/10.1109/TPAMI.2021.3097011.
  40. Siqiang Luo. Distributed pagerank computation: An improved theoretical study. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 4496–4503, 2019. URL https://doi.org/10.1609/aaai.v33i01.33014496.
  41. Yueming Lyu. Spherical structured feature maps for kernel approximation. In International Conference on Machine Learning, pages 2256–2264. PMLR, 2017. URL http://proceedings.mlr.press/v70/lyu17a/lyu17a.pdf.
  42. Quasi-monte carlo integration. Journal of computational physics, 122(2):218–230, 1995. URL https://doi.org/10.1006/jcph.1995.1209.
  43. Quadrature-based features for kernel approximation. Advances in neural information processing systems, 31, 2018. URL https://doi.org/10.48550/arXiv.1802.03832.
  44. Roger B Nelsen. An introduction to copulas. Springer, 2006.
  45. The pagerank citation ranking: Bring order to the web. Technical report, Technical report, stanford University, 1998. URL https://www.cis.upenn.edu/~mkearns/teaching/NetworkedLife/pagerank.pdf.
  46. An invitation to statistics in Wasserstein space. Springer Nature, 2020.
  47. Learning mesh-based simulation with graph networks. arXiv preprint arXiv:2010.03409, 2020. URL https://doi.org/10.48550/arXiv.2010.03409.
  48. Random features for large-scale kernel machines. Advances in neural information processing systems, 20, 2007. URL https://people.eecs.berkeley.edu/~brecht/papers/07.rah.rec.nips.pdf.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Isaac Reid (11 papers)
  2. Stratis Markou (15 papers)
  3. Krzysztof Choromanski (96 papers)
  4. Richard E. Turner (112 papers)
  5. Adrian Weller (150 papers)
Citations (1)