Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
184 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Gradient Descent is Pareto-Optimal in the Oracle Complexity and Memory Tradeoff for Feasibility Problems (2404.06720v1)

Published 10 Apr 2024 in math.OC, cs.CC, cs.DS, cs.LG, and stat.ML

Abstract: In this paper we provide oracle complexity lower bounds for finding a point in a given set using a memory-constrained algorithm that has access to a separation oracle. We assume that the set is contained within the unit $d$-dimensional ball and contains a ball of known radius $\epsilon>0$. This setup is commonly referred to as the feasibility problem. We show that to solve feasibility problems with accuracy $\epsilon \geq e{-d{o(1)}}$, any deterministic algorithm either uses $d{1+\delta}$ bits of memory or must make at least $1/(d{0.01\delta }\epsilon{2\frac{1-\delta}{1+1.01 \delta}-o(1)})$ oracle queries, for any $\delta\in[0,1]$. Additionally, we show that randomized algorithms either use $d{1+\delta}$ memory or make at least $1/(d{2\delta} \epsilon{2(1-4\delta)-o(1)})$ queries for any $\delta\in[0,\frac{1}{4}]$. Because gradient descent only uses linear memory $\mathcal O(d\ln 1/\epsilon)$ but makes $\Omega(1/\epsilon2)$ queries, our results imply that it is Pareto-optimal in the oracle complexity/memory tradeoff. Further, our results show that the oracle complexity for deterministic algorithms is always polynomial in $1/\epsilon$ if the algorithm has less than quadratic memory in $d$. This reveals a sharp phase transition since with quadratic $\mathcal O(d2 \ln1/\epsilon)$ memory, cutting plane methods only require $\mathcal O(d\ln 1/\epsilon)$ queries.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (63)
  1. Acceleration by stepsize hedging I: Multi-step descent and the silver stepsize schedule. arXiv preprint arXiv:2309.07879, 2023.
  2. Kurt M Anstreicher. On vaidya’s volumetric cutting plane method for convex programming. Mathematics of Operations Research, 22(1):63–89, 1997.
  3. A cutting plane algorithm for convex programming that uses analytic centers. Mathematical programming, 69(1-3):1–43, 1995.
  4. Parallelization does not accelerate convex optimization: Adaptivity lower bounds for non-smooth convex minimization. arXiv preprint arXiv:1808.03880, 2018.
  5. Time-space tradeoffs for learning finite functions from random evaluations, with applications to polynomials. In Proceedings of the 31st Conference On Learning Theory, pages 843–856. PMLR, 2018.
  6. Solving convex programs by random walks. Journal of the ACM (JACM), 51(4):540–556, 2004.
  7. Quadratic memory is necessary for optimal query complexity in convex optimization: Center-of-mass is pareto-optimal. In The Thirty Sixth Annual Conference on Learning Theory, pages 4696–4736. PMLR, 2023.
  8. Memory-constrained algorithms for convex optimization. Advances in Neural Information Processing Systems, 36, 2024.
  9. When is memorization of irrelevant training data necessary for high-accuracy learning? In Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing, STOC 2021, page 123–132. Association for Computing Machinery, 2021.
  10. Strong memory lower bounds for learning natural models. In Proceedings of Thirty Fifth Conference on Learning Theory, pages 4989–5029. PMLR, 2022.
  11. Complexity of highly parallel non-smooth convex optimization. Advances in neural information processing systems, 32, 2019.
  12. Xi Chen and Binghui Peng. Memory-query tradeoffs for randomized convex optimization. In 2023 IEEE 64th Annual Symposium on Foundations of Computer Science (FOCS), pages 1400–1413. IEEE, 2023.
  13. Branch-and-bound performance estimation programming: A unified methodology for constructing optimal optimization methods. Mathematical Programming, pages 1–73, 2023.
  14. Local operator theory, random matrices and banach spaces. Handbook of the geometry of Banach spaces, 1(317-366):131, 2001.
  15. Function minimization by conjugate gradients. The computer journal, 7(2):149–154, 1964.
  16. Extractor-based time-space lower bounds for learning. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2018, page 990–1002. Association for Computing Machinery, 2018.
  17. Time-space lower bounds for two-pass learning. In 34th Computational Complexity Conference (CCC), 2019.
  18. Memory-sample lower bounds for learning parity with noise. arXiv preprint arXiv:2107.02320, 2021.
  19. A survey of nonlinear conjugate gradient methods. Pacific journal of Optimization, 2(1):35–58, 2006.
  20. A bound on tail probabilities for quadratic forms in independent random variables. The Annals of Mathematical Statistics, 42(3):1079–1083, 1971.
  21. An improved cutting plane method for convex optimization, convex-concave games, and its applications. In Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing, pages 944–953, 2020.
  22. Leonid G Khachiyan. Polynomial algorithms in linear programming. USSR Computational Mathematics and Mathematical Physics, 20(1):53–72, 1980.
  23. Time-space hardness of learning sparse parities. In Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2017, page 1067–1080. Association for Computing Machinery, 2017.
  24. Communication-efficient algorithms for decentralized and stochastic optimization. Mathematical Programming, 180(1):237–284, March 2020. ISSN 1436-4646. doi: 10.1007/s10107-018-1355-4. URL https://doi.org/10.1007/s10107-018-1355-4.
  25. Adaptive estimation of a quadratic functional by model selection. Annals of statistics, pages 1302–1338, 2000.
  26. A faster cutting plane method and its implications for combinatorial and convex optimization. In 2015 IEEE 56th Annual Symposium on Foundations of Computer Science, pages 1049–1065. IEEE, 2015.
  27. Anatoly Yur’evich Levin. An algorithm for minimizing convex functions. In Doklady Akademii Nauk, volume 160, pages 1244–1247. Russian Academy of Sciences, 1965.
  28. On the limited memory BFGS method for large scale optimization. Mathematical Programming, 45(1):503–528, August 1989.
  29. Efficient convex optimization requires superlinear memory. In Conference on Learning Theory, pages 2390–2430. PMLR, 2022.
  30. Memory limited, streaming pca. In Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 2, NIPS’13, page 2886–2894, Red Hook, NY, USA, 2013. Curran Associates Inc.
  31. Mixing implies lower bounds for space bounded learning. In Proceedings of the 2017 Conference on Learning Theory, pages 1516–1566. PMLR, 2017.
  32. Entropy Samplers and Strong Generic Lower Bounds For Space Bounded Learning. In 9th Innovations in Theoretical Computer Science Conference (ITCS 2018), volume 94 of Leibniz International Proceedings in Informatics (LIPIcs), pages 28:1–28:20. Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik, 2018.
  33. D-admm: A communication-efficient distributed algorithm for separable optimization. IEEE Transactions on Signal Processing, 61(10):2718–2723, 2013. doi: 10.1109/TSP.2013.2254478.
  34. Arkadi Nemirovski. On parallel complexity of nonsmooth convex optimization. Journal of Complexity, 10(4):451–463, 1994.
  35. Problem complexity and method efficiency in optimization. 1983.
  36. Ju E Nesterov. Self-concordant functions and polynomial-time methods in convex programming. Report, Central Economic and Mathematic Institute, USSR Acad. Sci, 1989.
  37. Yurii Nesterov. Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media, 2003.
  38. Jorge Nocedal. Updating quasi-newton matrices with limited storage. Mathematics of Computation, 35(151):773–782, 1980.
  39. Near optimal memory-regret tradeoff for online learning. In 2023 IEEE 64th Annual Symposium on Foundations of Computer Science (FOCS), pages 1171–1194. IEEE, 2023.
  40. Online prediction in sub-linear space. In Proceedings of the 2023 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 1611–1634. SIAM, 2023.
  41. Newton sketch: A near linear-time optimization algorithm with linear-quadratic convergence. SIAM Journal on Optimization, 27(1):205–245, 2017.
  42. Ran Raz. A time-space lower bound for a large class of learning problems. In 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS), pages 732–742, 2017. doi: 10.1109/FOCS.2017.73.
  43. Ran Raz. Fast learning requires good memory: A time-space lower bound for parity learning. Journal of the ACM (JACM), 66(1):1–18, 2018.
  44. Aide: Fast and communication efficient distributed optimization. ArXiv, abs/1608.06879, 2016.
  45. Sub-sampled newton methods. Mathematical Programming, 174:293–326, 2019.
  46. Hanson-wright inequality and sub-gaussian concentration. 2013.
  47. Singular values of gaussian matrices and permanent estimators. Random Structures & Algorithms, 48(1):183–212, 2016.
  48. Communication-efficient distributed optimization using an approximate newton-type method. In Eric P. Xing and Tony Jebara, editors, Proceedings of the 31st International Conference on Machine Learning, volume 32 of Proceedings of Machine Learning Research, pages 1000–1008, Bejing, China, 22–24 Jun 2014. PMLR. URL https://proceedings.mlr.press/v32/shamir14.html.
  49. Memory-sample tradeoffs for linear regression with small error. In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, STOC 2019, page 890–901. Association for Computing Machinery, 2019.
  50. Cocoa: A general framework for communication-efficient distributed optimization. J. Mach. Learn. Res., 18(1):8590–8638, jan 2017. ISSN 1532-4435.
  51. Memory bounds for the experts problem. In Proceedings of the 54th Annual ACM SIGACT Symposium on Theory of Computing, pages 1158–1171, 2022.
  52. Minimax rates for memory-bounded sparse linear regression. In Proceedings of The 28th Conference on Learning Theory, pages 1564–1587. PMLR, 2015.
  53. Memory, communication, and statistical queries. In 29th Annual Conference on Learning Theory, pages 1490–1516. PMLR, 2016.
  54. Terence Tao. Topics in random matrix theory, volume 132. American Mathematical Society, 2023.
  55. Sergei Pavlovich Tarasov. The method of inscribed ellipsoids. In Soviet Mathematics-Doklady, volume 37, pages 226–230, 1988.
  56. Pravin M Vaidya. A new algorithm for minimizing convex functions over convex sets. Mathematical programming, 73(3):291–341, 1996.
  57. Roman Vershynin. High-dimensional probability. University of California, Irvine, 10:11, 2020.
  58. Divakar Viswanath and LN Trefethen. Condition numbers of random triangular matrices. SIAM Journal on Matrix Analysis and Applications, 19(2):564–581, 1998.
  59. Memory and communication efficient distributed stochastic optimization with minibatch prox. In Satyen Kale and Ohad Shamir, editors, Proceedings of the 2017 Conference on Learning Theory, volume 65 of Proceedings of Machine Learning Research, pages 1882–1919. PMLR, 07–10 Jul 2017. URL https://proceedings.mlr.press/v65/wang17a.html.
  60. Gradient sparsification for communication-efficient distributed optimization. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, NIPS’18, page 1306–1316, Red Hook, NY, USA, 2018. Curran Associates Inc.
  61. Open problem: The oracle complexity of convex optimization with limited memory. In Conference on Learning Theory, pages 3202–3210. PMLR, 2019.
  62. Evaluation of the information complexity of mathematical programming problems. Ekonomika i Matematicheskie Metody, 12:128–142, 1976.
  63. Communication-efficient algorithms for statistical optimization. In 2012 IEEE 51st IEEE Conference on Decision and Control (CDC), pages 6792–6792, 2012. doi: 10.1109/CDC.2012.6426691.
Citations (1)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com