Distributionally Robust Optimization and Robust Statistics (2401.14655v1)
Abstract: We review distributionally robust optimization (DRO), a principled approach for constructing statistical estimators that hedge against the impact of deviations in the expected loss between the training and deployment environments. Many well-known estimators in statistics and machine learning (e.g. AdaBoost, LASSO, ridge regression, dropout training, etc.) are distributionally robust in a precise sense. We hope that by discussing the DRO interpretation of well-known estimators, statisticians who may not be too familiar with DRO may find a way to access the DRO literature through the bridge between classical results and their DRO equivalent formulation. On the other hand, the topic of robustness in statistics has a rich tradition associated with removing the impact of contamination. Thus, another objective of this paper is to clarify the difference between DRO and classical statistical robustness. As we will see, these are two fundamentally different philosophies leading to completely different types of estimators. In DRO, the statistician hedges against an environment shift that occurs after the decision is made; thus DRO estimators tend to be pessimistic in an adversarial setting, leading to a min-max type formulation. In classical robust statistics, the statistician seeks to correct contamination that occurred before a decision is made; thus robust statistical estimators tend to be optimistic leading to a min-min type formulation.
- Generalization bounds for (Wasserstein) robust optimization. In Advances in Neural Information Processing Systems, volume 34, pages 10382–10392.
- Trimmed statistical estimation via variance reduction. Mathematics of Operations Research, 45(1):292–322.
- Robust linear least squares regression. Annals of Statistics, 39(5):2766 – 2794.
- Regularization for wasserstein distributionally robust optimization. ESAIM: Control, Optimisation and Calculus of Variations, 29:33.
- Estimating processes in adapted Wasserstein distance. Annals of Applied Probability, 32(1):529–550.
- Sensitivity analysis of Wasserstein distributionally robust optimization problems. Proceedings of the Royal Society A, 477(2256):20210176.
- Confidence regions and minimax rates in outlier-robust estimation on the probability simplex. Electronic Journal of Statistics.
- Convex Analysis and Monotone Operator Theory in Hilbert Spaces. CMS Books in Mathematics. Springer New York.
- Square-root lasso: pivotal recovery of sparse signals via conic programming. Biometrika, 98(4):791–806.
- Minimax instrumental variable regression and L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT convergence guarantees without identification or closedness. arXiv preprint arXiv:2302.05404.
- Holistic robust data-driven decisions. arXiv preprint arXiv:2207.09560.
- Reproducing Kernel Hilbert Spaces in Probability and Statistics. Springer Science & Business Media.
- Bernholt, T. (2006). Robust estimators are hard to compute. Technical Report 2005,52, Universität Dortmund.
- Distributionally robust causal inference with observational data. arXiv preprint arXiv:2210.08326.
- Consistent robust regression. In Advances in Neural Information Processing Systems, volume 30.
- Robust regression via hard thresholding. In Advances in Neural Information Processing Systems, volume 28.
- Distributionally robust groupwise regularization estimator. In Asian Conference on Machine Learning, pages 97–112. PMLR.
- Semi-supervised learning based on distributionally robust optimization. Data Analysis and Applications 3: Computational, Classification, Financial, Statistical and Stochastic Methods, 5:1–33.
- Sample out-of-sample inference based on Wasserstein distance. Operations Research, 69(3):985–1013.
- Robust Wasserstein profile inference and applications to machine learning. Journal of Applied Probability, 56(3):830–857.
- Dropout training is distributionally robust optimal. Journal of Machine Learning Research, 24(180):1–60.
- Doubly robust data-driven distributionally robust optimization. Applied Modeling Techniques and Data Analysis 1: Computational Data Analysis Methods and Tools, 7:75–90.
- Unifying distributionally robust optimization via optimal transport theory. arXiv preprint arXiv:2308.05414.
- Statistical analysis of wasserstein distributionally robust estimators. In Tutorials in Operations Research: Emerging Optimization Methods and Modeling Techniques with Applications, pages 227–254. INFORMS.
- Confidence regions in Wasserstein distributionally robust estimation. Biometrika, 109(2):295–315.
- Optimal transport-based distributionally robust optimization: Structural properties and iterative schemes. Mathematics of Operations Research, 47(2):1500–1529.
- Statistical limit theorems in distributionally robust optimization. arXiv preprint arXiv:2303.14867.
- A distributionally robust boosting algorithm. In 2019 Winter Simulation Conference, pages 3728–3739. IEEE.
- Box, G. E. (1976). Science and statistics. Journal of the American Statistical Association, 71(356):791–799.
- Box, G. E. (1979). Robustness in the strategy of scientific model building. In Robustness in Statistics, pages 201–236. Elsevier.
- Box, G. E. P. (1953). Non-normality and tests on variances. Biometrika, 40(3/4):318–335.
- Robust estimation via generative adversarial networks. In International Conference on Learning Representations.
- Learning from untrusted data. In Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, pages 47–60.
- Robust covariance and scatter matrix estimation under Huber’s contamination model. Annals of Statistics, 46(5):1932 – 1960.
- Csiszár, I. (1975). I-divergence geometry of probability distributions and minimization problems. Annals of Probability, pages 146–158.
- Entropy-regularized wasserstein distributionally robust shape and topology optimization. Structural and Multidisciplinary Optimization, 66(3):42.
- Distributionally robust optimization under moment uncertainty with application to data-driven problems. Operations Research, 58(3):595–612.
- Robust sub-Gaussian estimation of a mean vector in nearly linear time. Annals of Statistics, 50(1):511 – 536.
- Sub-Gaussian mean estimators. Annals of Statistics, 44(6):2695 – 2725.
- Robust estimators in high-dimensions without the computational intractability. SIAM Journal on Computing, 48(2):742–864.
- Sever: A robust meta-algorithm for stochastic optimization. In International Conference on Machine Learning, pages 1596–1606. PMLR.
- Being robust (in high dimensions) can be practical. In International Conference on Machine Learning, pages 999–1008. PMLR.
- Algorithmic High-Dimensional Robust Statistics. Cambridge University Press.
- Recent advances in algorithmic high-dimensional robust statistics. arXiv preprint arXiv:1911.05911.
- Statistical query lower bounds for robust estimation of high-dimensional Gaussians and Gaussian mixtures. In 2017 IEEE 58th Annual Symposium on Foundations of Computer Science, pages 73–84.
- Learning geometric concepts with nasty noise. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, pages 1061–1073.
- List-decodable robust mean estimation and learning mixtures of spherical gaussians. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, pages 1047–1060.
- High dimensional robust M-estimation: Asymptotic variance via approximate message passing. Probability Theory and Related Fields, 166(3):935–969.
- Donoho, D. L. (1994). Statistical Estimation and Optimal Recovery. Annals of Statistics, 22(1):238 – 270.
- Breakdown properties of location estimates based on halfspace depth and projected outlyingness. Annals of Statistics, 20(4):1803–1827.
- The notion of breakdown point. A festschrift for Erich L. Lehmann, 157184.
- The “automatic” robustness of minimum distance functionals. Annals of Statistics, 16(2):552–586.
- Geometrizing rates of convergence, III. Annals of Statistics, pages 668–701.
- Distributionally robust losses for latent covariate mixtures. Operations Research, 71(2):649–664.
- Variance-based regularization with convex objectives. Journal of Machine Learning Research, 19:1–55.
- Statistics of robust optimization: A generalized empirical likelihood approach. Mathematics of Operations Research, 46(3):946–969.
- Learning models with uniform performance via distributionally robust optimization. Annals of Statistics, 49(3):1378–1406.
- On the rate of convergence in Wasserstein distance of the empirical measure. Probability Theory and Related Fields, 162(3-4):707–738.
- A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1):119–139.
- Gao, C. (2020). Robust regression via mutivariate regression depth. Bernoulli, 26(2):1139 – 1170.
- Gao, R. (2022). Finite-sample guarantees for Wasserstein distributionally robust optimization: Breaking the curse of dimensionality. Operations Research.
- Wasserstein distributionally robust optimization and variation regularization. Operations Research.
- Distributionally robust stochastic optimization with Wasserstein distance. Mathematics of Operations Research, 48(2):603–655.
- Robust hypothesis testing using Wasserstein uncertainty sets. In Advances in Neural Information Processing Systems, volume 31.
- Generative adversarial nets. In Advances in Neural Information Processing Systems, volume 27.
- Explaining and harnessing adversarial examples. In International Conference on Learning Representations.
- A data-driven approach to beating SAA out of sample. Operations Research.
- Minimax robust hypothesis testing. IEEE Transactions on Information Theory, 63(9):5572–5587.
- Hampel, F. (1968). Contributions to the Theory of Robust Estimation. University of California.
- Hampel, F. R. (1971). A general qualitative definition of robustness. Annals of Mathematical Statistics, 42(6):1887 – 1896.
- Higher-order expansion and bartlett correctability of distributionally robust optimization. arXiv preprint arXiv:2108.05908.
- Mixture models, robustness, and sum of squares proofs. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, pages 1021–1034.
- Does distributionally robust supervised learning give robust classifiers? In International Conference on Machine Learning, pages 2029–2037. PMLR.
- Kullback-leibler divergence constrained distributionally robust optimization. Available at Optimization Online, 1(2):9.
- Huber, P. (2004). Robust Statistics. Wiley Series in Probability and Statistics - Applied Probability and Statistics Section Series. Wiley.
- Huber, P. J. (1964). Robust estimation of a location parameter. Annals of Mathematical Statistics, 35(1):73–101.
- Huber, P. J. (1968). Robust confidence limits. Zeitschrift für Wahrscheinlichkeitstheorie und Verwandte Gebiete, 10(4):269–278.
- Huber, P. J. (1972). The 1972 Wald lecture robust statistics: A review. Annals of Mathematical Statistics, 43(4):1041 – 1067.
- A review of deep transfer learning and recent advancements. Technologies, 11(2):40.
- Distributionally favorable optimization: A framework for data-driven decision-making with endogenous outliers. Available at Optimization Online.
- The densest hemisphere problem. Theoretical Computer Science, 6(1):93–107.
- On the estimation of the mean of a random vector. Electronic Journal of Statistics, 11(1):440 – 451.
- On a space of totally additive functions. Vestnik of the St. Petersburg University: Mathematics, 13(7):52–59.
- List-decodable linear regression. In Advances in Neural Information Processing Systems, volume 32.
- Efficient algorithms for outlier-robust regression. In Conference On Learning Theory, pages 1420–1430. PMLR.
- Better agnostic clustering via relaxed tensor norms. arXiv preprint arXiv:1711.07465.
- Wasserstein distributionally robust optimization: Theory and applications in machine learning. In Operations Research & Management Science in the Age of Analytics, pages 130–166. INFORMS.
- Agnostic estimation of mean and covariance. In 2016 IEEE 57th Annual Symposium on Foundations of Computer Science, pages 665–674. IEEE Computer Society.
- Lam, H. (2016). Robust sensitivity analysis for stochastic systems. Mathematics of Operations Research, 41(4):1248–1275.
- Lam, H. (2018). Sensitivity to serial dependency of input processes: A robust approach. Management Science, 64(3):1311–1327.
- Lam, H. (2021). On the impossibility of statistically improving empirical optimization: A second-order stochastic dominance perspective. arXiv preprint arXiv:2105.13419.
- The empirical likelihood approach to quantifying uncertainty in sample average approximation. Operations Research Letters, 45(4):301–307.
- Minimax statistical learning with Wasserstein distances. Advances in Neural Information Processing Systems, 31.
- Robust state space filtering under incremental model perturbations subject to a relative entropy tolerance. IEEE Transactions on Automatic Control, 58(3):682–695.
- Large-scale methods for distributionally robust optimization. In Advances in Neural Information Processing Systems, volume 33, pages 8847–8860.
- Fast epigraphical projection-based incremental algorithms for Wasserstein distributionally robust support vector machine. In Advances in Neural Information Processing Systems, volume 33, pages 4029–4039.
- A first-order algorithmic framework for distributionally robust logistic regression. In Advances in Neural Information Processing Systems, volume 32.
- Tikhonov regularization is optimal transport robust under martingale constraints. In Advances in Neural Information Processing Systems, volume 35, pages 17677–17689.
- Reinforcement learning in robust Markov decision processes. In Advances in Neural Information Processing Systems, volume 26.
- Density estimation with contamination: minimax rates and theory of adaptation. Electronic Journal of Statistics, 13(2):3613 – 3653.
- High dimensional robust sparse regression. In International Conference on Artificial Intelligence and Statistics, pages 411–421. PMLR.
- Robust W-GAN-based estimation under Wasserstein contamination. Information and Inference: A Journal of the IMA, 12(1):312–362.
- Smoothed f𝑓fitalic_f-divergence distributionally robust optimization: Exponential rate efficiency and complexity-free calibration. arXiv preprint arXiv:2306.14041.
- Wasserstein distributionally robust linear-quadratic estimation under martingale constraints. In International Conference on Artificial Intelligence and Statistics, pages 8629–8644. PMLR.
- Sub-Gaussian estimators of the mean of a random vector. Annals of Statistics, 47(2):783 – 794.
- Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083.
- Minsker, S. (2015). Geometric median and robust estimation in Banach spaces. Bernoulli, 21(4):2308 – 2335.
- Data-driven distributionally robust optimization using the Wasserstein metric: Performance guarantees and tractable reformulations. Mathematical Programming, 171(1-2):115–166.
- Ng, A. Y. (2004). Feature selection, L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT vs. L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT regularization, and rotational invariance. In Proceedings of the Twenty-First International Conference on Machine Learning, page 78.
- Bridging Bayesian and minimax mean square error estimation via Wasserstein distributionally robust optimization. Mathematics of Operations Research, 48(1):1–37.
- Optimistic distributionally robust optimization for nonparametric likelihood approximation. In Advances in Neural Information Processing Systems, volume 32.
- Robust Bayesian classification using an optimistic score ratio. In International Conference on Machine Learning, pages 7327–7337. PMLR.
- Distributionally robust local non-parametric conditional estimation. Advances in Neural Information Processing Systems, 33:15232–15242.
- Robustifying conditional portfolio decisions via optimal transport. arXiv preprint arXiv:2103.16451.
- Distributionally robust parametric maximum likelihood estimation. In Advances in Neural Information Processing Systems, volume 33, pages 7922–7932.
- Optimistic robust optimization with applications to machine learning. arXiv preprint arXiv:1711.07511.
- On the generalization error of norm penalty linear regression models. arXiv preprint arXiv:2211.07608.
- Osogami, T. (2012). Robustness and risk-sensitivity in Markov decision processes. In Advances in Neural Information Processing Systems, volume 25.
- Owen, A. B. (2001). Empirical Likelihood. CRC press.
- Computational optimal transport: With applications to data science. Foundations and Trends® in Machine Learning, 11(5-6):355–607.
- Robust estimation via robust gradient estimation. Journal of the Royal Statistical Society Series B: Statistical Methodology, 82(3):601–627.
- List decodable learning via sum of squares. In Proceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 161–180. SIAM.
- Frameworks and results in distributionally robust optimization. Open Journal of Mathematical Optimization, 3:1–85.
- Rockafellar, R. (1974). Conjugate Duality and Optimization. Society for Industrial and Applied Mathematics.
- Rockafellar, R. (1985). Extensions of subgradient calculus with applications to optimization. Nonlinear Analysis: Theory, Methods & Applications, 9(7):665–698.
- Rockafellar, R. (1997). Convex Analysis. Princeton Landmarks in Mathematics and Physics. Princeton University Press.
- Rockafellar, R. T. (1963). Convex Functions and Dual Extremum Problems. Phd thesis, University of Washington.
- Rockafellar, R. T. (2023). Distributional robustness, stochastic divergences, and the quadrangle of risk.
- Distributionally robust and generalizable inference. Statistical Science, 38(4):527–542.
- An Optimization Primer. Springer Series in Operations Research and Financial Engineering. Springer International Publishing.
- Royset, J. O. (2021). Good and bad optimization models: Insights from Rockafellians. In Tutorials in Operations Research: Emerging Optimization Methods and Modeling Techniques with Applications, pages 131–160. INFORMS.
- Rockafellian relaxation in optimization under uncertainty: Asymptotically exact formulations. arXiv preprint arXiv:2204.04762.
- Optimization of risk measures. Probabilistic and Randomized Methods for Design under Uncertainty, pages 119–157.
- Distributionally robust neural networks. In International Conference on Learning Representations.
- Santambrogio, F. (2015). Optimal Transport for Applied Mathematicians: Calculus of Variations, PDEs, and Modeling, volume 87. Birkhäuser.
- Scarf, H. (1958). A min-max solution of an inventory problem. In Studies in the Mathematical Theory of Inventory and Production, pages 201–209. Stanford University Press.
- A formula for sample sizes for population tolerance limits. Annals of Mathematical Statistics, 15(2):217.
- Non-parametric estimation. I. Validation of order statistics. Annals of Mathematical Statistics, 16(2):187 – 192.
- New perspectives on regularization and computation in optimal transport-based distributionally robust optimization. arXiv preprint arXiv:2303.03900.
- Regularization via mass transportation. Journal of Machine Learning Research, 20(103):1–68.
- Wasserstein distributionally robust Kalman filtering. In Advances in Neural Information Processing Systems, volume 31.
- Shapiro, A. (2017). Distributionally robust stochastic programming. SIAM Journal on Optimization, 27(4):2258–2275.
- Testing group fairness via optimal transport projections. In International Conference on Machine Learning, pages 9649–9659. PMLR.
- Distributionally robust batch contextual bandits. Management Science.
- Certifying some distributional robustness with principled adversarial training. In International Conference on Learning Representations.
- Distributionally robust optimization and generalization in kernel methods. In Advances in Neural Information Processing Systems, volume 32.
- Resilience: A criterion for learning in the presence of arbitrary outliers. In 9th Innovations in Theoretical Computer Science Conference, volume 94, pages 45:1–45:21.
- Certified defenses for data poisoning attacks. In Advances in Neural Information Processing Systems, volume 30, page 3520–3532.
- Stone, C. J. (1977). Consistent nonparametric regression. The annals of statistics, pages 595–620.
- Strassen, V. (1965). The existence of probability measures with given marginals. Annals of Mathematical Statistics, 36(2):423–439.
- Adaptive hard thresholding for near-optimal consistent robust regression. In Conference on Learning Theory, pages 2892–2897. PMLR.
- A data-driven approach to robust hypothesis testing using kernel mmd uncertainty sets. In 2021 IEEE International Symposium on Information Theory (ISIT), pages 3056–3061. IEEE.
- Székely, G. J. (1989). Potential and kinetic energy in statistics. Lecture Notes, Budapest Institute of Technology (Technical University).
- A distributionally robust approach to fair classification. arXiv preprint arXiv:2007.09530.
- Sequential domain adaptation by synthesizing distributionally robust experts. In International Conference on Machine Learning, pages 10162–10172. PMLR.
- Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B: Statistical Methodology, 58(1):267–288.
- Tukey, J. W. (1960). A survey of sampling from contaminated distributions. Contributions to Probability and Statistics: Essays in Honor of Harold Hotelling, pages 448–485.
- Tukey, J. W. (1962). The future of data analysis. Annals of Mathematical Statistics, 33(1):1–67.
- Tukey, J. W. (1975). Mathematics and the picturing of data. In Proceedings of the International Congress of Mathematicians, volume 2, page 523–531.
- Vaart, A. W. v. d. (1998). Asymptotic Statistics. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press.
- From data to decisions: Distributionally robust optimization is optimal. Management Science, 67(6):3387–3402.
- Villani, C. et al. (2009). Optimal Transport: Old and New, volume 338. Springer.
- Sinkhorn distributionally robust optimization. arXiv preprint arXiv:2109.11926.
- On the foundation of distributionally robust reinforcement learning. arXiv preprint arXiv:2311.09018.
- Approximate models and robust decisions. Statistical Science, 31(4):465–489.
- A survey of transfer learning. Journal of Big data, 3(1):1–40.
- A survey of unsupervised deep domain adaptation. ACM Transactions on Intelligent Systems and Technology (TIST), 11(5):1–46.
- On minimax optimality of gans for robust mean estimation. In International Conference on Artificial Intelligence and Statistics, volume 108, pages 4541–4551. PMLR.
- Distributionally robust Markov decision processes. In Advances in Neural Information Processing Systems, volume 23.
- Zalinescu, C. (2002). Convex Analysis in General Vector Spaces. G - Reference, Information and Interdisciplinary Subjects Series. World Scientific.
- A class of geometric structures in transfer learning: Minimax bounds and optimality. In International Conference on Artificial Intelligence and Statistics, pages 3794–3820. PMLR.
- Distributionally robust Gaussian process regression and Bayesian inverse problems. arXiv preprint arXiv:2205.13111.
- Finite-sample regret bound for distributionally robust offline tabular reinforcement learning. In International Conference on Artificial Intelligence and Statistics, pages 3331–3339. PMLR.
- Generalized resilience and robust statistics. Annals of Statistics, 50(4):2256 – 2283.
- Kernel distributionally robust optimization: Generalized duality theorem and stochastic approximation. In International Conference on Artificial Intelligence and Statistics, pages 280–288. PMLR.
- Zorzi, M. (2016). Robust Kalman filtering under model perturbations. IEEE Transactions on Automatic Control, 62(6):2902–2907.