The Implicit Bias of Heterogeneity towards Invariance: A Study of Multi-Environment Matrix Sensing (2403.01420v3)
Abstract: Models are expected to engage in invariance learning, which involves distinguishing the core relations that remain consistent across varying environments to ensure the predictions are safe, robust and fair. While existing works consider specific algorithms to realize invariance learning, we show that model has the potential to learn invariance through standard training procedures. In other words, this paper studies the implicit bias of Stochastic Gradient Descent (SGD) over heterogeneous data and shows that the implicit bias drives the model learning towards an invariant solution. We call the phenomenon the implicit invariance learning. Specifically, we theoretically investigate the multi-environment low-rank matrix sensing problem where in each environment, the signal comprises (i) a lower-rank invariant part shared across all environments; and (ii) a significantly varying environment-dependent spurious component. The key insight is, through simply employing the large step size large-batch SGD sequentially in each environment without any explicit regularization, the oscillation caused by heterogeneity can provably prevent model learning spurious signals. The model reaches the invariant solution after certain iterations. In contrast, model learned using pooled SGD over all data would simultaneously learn both the invariant and spurious signals. Overall, we unveil another implicit bias that is a result of the symbiosis between the heterogeneity of data and modern algorithms, which is, to the best of our knowledge, first in the literature.
- Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- John Aldrich. Autonomy. Oxford Economic Papers, 41(1):15–34, 1989.
- A convergence theory for deep learning via over-parameterization. In International Conference on Machine Learning, pages 242–252. PMLR, 2019.
- A framework for learning predictive structures from multiple tasks and unlabeled data. Journal of Machine Learning Research, 6(11), 2005.
- Invariant risk minimization. arXiv preprint arXiv:1907.02893, 2019.
- Understanding gradient descent on the edge of stability in deep learning. In International Conference on Machine Learning, pages 948–1024. PMLR, 2022.
- Implicit regularization for deep neural networks driven by an ornstein-uhlenbeck like process. In Conference on Learning Theory, pages 483–513. PMLR, 2020.
- Emmanuel J Candes. The restricted isometry property and its implications for compressed sensing. Comptes rendus. Mathematique, 346(9-10):589–592, 2008.
- Decoding by linear programming. IEEE transactions on information theory, 51(12):4203–4215, 2005.
- Tight oracle inequalities for low-rank matrix recovery from a minimal number of noisy random measurements. IEEE Transactions on Information Theory, 57(4):2342–2359, 2011. doi: 10.1109/TIT.2011.2111771.
- Stochastic gradient descent performs variational inference, converges to limit cycles for deep networks. In 2018 Information Theory and Applications Workshop (ITA), pages 1–10. IEEE, 2018.
- Label noise SGD provably prefers flat global minimizers. Advances in Neural Information Processing Systems, 34, 2021.
- Gradient descent provably optimizes over-parameterized neural networks. In International Conference on Learning Representations, 2018.
- Endogeneity in high dimensions. Annals of Statistics, 42(3):872, 2014.
- Environment invariant linear least squares. arXiv preprint arXiv:2303.03092, 2023a.
- Understanding implicit regularization in over-parameterized single index model. Journal of the American Statistical Association, 118(544):2315–2328, 2023b.
- Modeling from features: a mean-field framework for over-parameterized deep neural networks. In Conference on Learning Theory, pages 1887–1936. PMLR, 2021.
- Learning causal structures using regression invariance. Advances in Neural Information Processing Systems, 30, 2017.
- The implicit bias of depth: How incremental learning drives generalization. In International Conference on Learning Representations, 2019.
- Implicit regularization in matrix factorization. Advances in Neural Information Processing Systems, 30, 2017.
- Implicit bias of gradient descent on linear convolutional networks. Advances in Neural Information Processing Systems, 31, 2018.
- Context is environment. In NeurIPS 2023 Workshop on Distribution Shifts: New Frontiers with Foundation Models, 2023.
- Shape matters: Understanding the implicit bias of the noise covariance. In Conference on Learning Theory, pages 2315–2357. PMLR, 2021.
- Invariant causal prediction for nonlinear models. Journal of Causal Inference, 6(2), September 2018. ISSN 2193-3677. doi: 10.1515/jci-2017-0016. URL http://dx.doi.org/10.1515/jci-2017-0016.
- Kevin D Hoover. The logic of causal inference: Econometrics and the conditional analysis of causation. Economics & Philosophy, 6(2):207–234, 1990.
- Neural tangent kernel: Convergence and generalization in neural networks. Advances in Neural Information Processing Systems, 31, 2018.
- The implicit bias of gradient descent on nonseparable data. In Conference on Learning Theory, pages 1772–1798. PMLR, 2019a.
- A refined primal-dual analysis of the implicit bias. Journal of Environmental Sciences (China) English Ed, 2019b.
- Algorithmic regularization in model-free overparametrized asymmetric matrix factorization. SIAM Journal on Mathematics of Data Science, 5(3):723–744, 2023.
- Understanding incremental learning of gradient descent: A fine-grained analysis of matrix sensing. In International Conference on Machine Learning, pages 15200–15238. PMLR, 2023.
- Sgd on neural networks learns functions of increasing complexity. Advances in Neural Information Processing Systems, 32, 2019.
- Out-of-distribution generalization via risk extrapolation (rex). In International Conference on Machine Learning, pages 5815–5826. PMLR, 2021.
- Algorithmic regularization in over-parameterized matrix sensing and neural networks with quadratic activations. In Conference on Learning Theory, pages 2–47. PMLR, 2018.
- What happens after sgd reaches zero loss?–a mathematical framework. In International Conference on Learning Representations, 2021.
- Nonlinear invariant risk minimization: A causal approach. arXiv preprint arXiv:2102.12353, 2021.
- Understanding the generalization benefit of normalization layers: Sharpness reduction. Advances in Neural Information Processing Systems, 35, 2022.
- Algorithmic regularization in tensor optimization: Towards a lifted approach in matrix sensing. In Advances in Neural Information Processing Systems, 2023.
- Methods for causal inference from gene perturbation experiments and validation. Proceedings of the National Academy of Sciences, 113(27):7361–7368, 2016.
- Convergence of gradient descent on separable data. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 3420–3428. PMLR, 2019.
- Yurii Nesterov. Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media, 2013.
- Judea Pearl. Causality. Cambridge university press, 2009.
- Causal inference in statistics: A primer. John Wiley & Sons, 2016.
- Causal inference by using invariant prediction: identification and confidence intervals. Journal of the Royal Statistical Society Series B: Statistical Methodology, 78(5):947–1012, 2016.
- Elements of causal inference: foundations and learning algorithms. The MIT Press, 2017.
- Stabilizing variable selection and regression. The Annals of Applied Statistics, 15(3), September 2021. ISSN 1932-6157. doi: 10.1214/21-aoas1487. URL http://dx.doi.org/10.1214/21-aoas1487.
- Causal dantzig: Fast inference in linear structural equation models with hidden variables under additive interventions. Annals of Statistics, 47(3), June 2019. ISSN 0090-5364. doi: 10.1214/18-aos1732. URL http://dx.doi.org/10.1214/18-AOS1732.
- Anchor regression: Heterogeneous data meet causality. Journal of the Royal Statistical Society Series B: Statistical Methodology, 83(2):215–246, January 2021. ISSN 1467-9868. doi: 10.1111/rssb.12398. URL http://dx.doi.org/10.1111/rssb.12398.
- Distributionally robust neural networks. In International Conference on Learning Representations, 2019.
- A mean field view of the landscape of two-layers neural networks. Proceedings of the National Academy of Sciences, 115(33):E7665–E7671, 2018.
- The implicit bias of gradient descent on separable data. Journal of Machine Learning Research, 19(1):2822–2878, 2018.
- Small random initialization is akin to spectral learning: Optimization and generalization guarantees for overparameterized low-rank matrix reconstruction. Advances in Neural Information Processing Systems, 34, 2021.
- Roman Vershynin. High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press, 2018.
- How noise affects the hessian spectrum in overparameterized neural networks. arXiv preprint arXiv:1910.00195, 2019.
- Sho Yaida. Fluctuation-dissipation relations for stochastic gradient descent. In International Conference on Learning Representations, 2018.
- Invariant causal prediction for block mdps. In International Conference on Machine Learning, pages 11214–11224. PMLR, 2020.
- The anisotropic noise in stochastic gradient descent: Its behavior of escaping from sharp minima and regularization effects. In International Conference on Machine Learning, pages 7654–7663. PMLR, 2019.
- On the computational and statistical complexity of over-parameterized matrix sensing. arXiv preprint arXiv:2102.02756, 2021.
- Gradient descent optimizes over-parameterized deep relu networks. Machine learning, 109:467–492, 2020.
- Diagnostics and correction of batch effects in large‐scale proteomic studies: a tutorial. Molecular Systems Biology, 17(8), August 2021. ISSN 1744-4292. doi: 10.15252/msb.202110240. URL http://dx.doi.org/10.15252/msb.202110240.