Transfer Learning Beyond Bounded Density Ratios (2403.11963v1)
Abstract: We study the fundamental problem of transfer learning where a learning algorithm collects data from some source distribution $P$ but needs to perform well with respect to a different target distribution $Q$. A standard change of measure argument implies that transfer learning happens when the density ratio $dQ/dP$ is bounded. Yet, prior thought-provoking works by Kpotufe and Martinet (COLT, 2018) and Hanneke and Kpotufe (NeurIPS, 2019) demonstrate cases where the ratio $dQ/dP$ is unbounded, but transfer learning is possible. In this work, we focus on transfer learning over the class of low-degree polynomial estimators. Our main result is a general transfer inequality over the domain $\mathbb{R}n$, proving that non-trivial transfer learning for low-degree polynomials is possible under very mild assumptions, going well beyond the classical assumption that $dQ/dP$ is bounded. For instance, it always applies if $Q$ is a log-concave measure and the inverse ratio $dP/dQ$ is bounded. To demonstrate the applicability of our inequality, we obtain new results in the settings of: (1) the classical truncated regression setting, where $dQ/dP$ equals infinity, and (2) the more recent out-of-distribution generalization setting for in-context learning linear functions with transformers. We also provide a discrete analogue of our transfer inequality on the Boolean Hypercube ${-1,1}n$, and study its connections with the recent problem of Generalization on the Unseen of Abbe, Bengio, Lotfi and Rizk (ICML, 2023). Our main conceptual contribution is that the maximum influence of the error of the estimator $\widehat{f}-f*$ under $Q$, $\mathrm{I}{\max}(\widehat{f}-f*)$, acts as a sufficient condition for transferability; when $\mathrm{I}{\max}(\widehat{f}-f*)$ is appropriately bounded, transfer is possible over the Boolean domain.
- Learning to reason with neural networks: Generalization, unseen data and boolean measures. Advances in Neural Information Processing Systems, 35:2709–2722, 2022.
- Generalization on the unseen, logic reasoning and degree curriculum. arXiv preprint arXiv:2301.13105, 2023.
- Transformers learn to implement preconditioned gradient descent for in-context learning. arXiv preprint arXiv:2306.00297, 2023.
- Linear attention is (maybe) all you need (to understand transformer optimization). arXiv preprint arXiv:2310.01082, 2023.
- Optimal learners for realizable regression: Pac learning and online learning. Advances in Neural Information Processing Systems, 36, 2024.
- What learning algorithm is in-context learning? investigations with linear models. arXiv preprint arXiv:2211.15661, 2022.
- Transformers as statisticians: Provable in-context learning with in-context algorithm selection. arXiv preprint arXiv:2306.04637, 2023.
- A theory of learning from different domains. Machine learning, 79:151–175, 2010.
- Analysis of representations for domain adaptation. Advances in neural information processing systems, 19, 2006.
- Implicit bias of gradient descent for wide two-layer neural networks trained with the logistic loss. In Conference on Learning Theory, pages 1305–1338. PMLR, 2020.
- Learning bounds for importance weighting. Advances in neural information processing systems, 23, 2010.
- AÂ Clifford Cohen. Truncated and censored samples: theory and applications. CRC press, 1991.
- Distributional and lq norm inequalities for polynomials over convex bodies in rn. Mathematical research letters, 8(3):233–248, 2001.
- Transfer learning for nonparametric classification: Minimax rate and adaptive classifier. The Annals of Statistics, 49(1), 2021.
- Efficient statistics, in high dimensions, from truncated samples. In 2018 IEEE 59th Annual Symposium on Foundations of Computer Science (FOCS), pages 639–649. IEEE, 2018.
- Computationally and statistically efficient truncated regression. In Conference on Learning Theory, pages 955–960. PMLR, 2019.
- Bounding the average sensitivity and noise sensitivity of polynomial threshold functions. In Proceedings of the forty-second ACM symposium on Theory of computing, pages 533–542, 2010.
- A statistical taylor theorem and extrapolation of truncated densities. In Conference on Learning Theory, pages 1395–1398. PMLR, 2021.
- Impossibility theorems for domain adaptation. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pages 129–136. JMLR Workshop and Conference Proceedings, 2010.
- Detecting low-degree truncation. arXiv preprint arXiv:2402.08133, 2024.
- Testing convex truncation. In Proceedings of the 2023 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 4050–4082. SIAM, 2023.
- Sparse covers for sums of indicators. Probability Theory and Related Fields, 162:679–705, 2015.
- Truncated linear regression in high dimensions. Advances in Neural Information Processing Systems, 33:10338–10347, 2020.
- Improved approximation of linear threshold functions. computational complexity, 22(3):623–677, 2013.
- Efficient truncated linear regression with unknown noise variance. Advances in Neural Information Processing Systems, 34:1952–1963, 2021.
- (s) gd over diagonal linear networks: Implicit regularisation, large stepsizes and edge of stability. arXiv preprint arXiv:2302.08982, 2023.
- P Erdös. On a lemma of littlewood and offord. Bulletin of the American Mathematical Society, 51(12):898–902, 1945.
- Efficient algorithms for learning from coarse labels. In Conference on Learning Theory, pages 2060–2079. PMLR, 2021.
- Combinatorial anti-concentration inequalities, with applications. In Mathematical Proceedings of the Cambridge Philosophical Society, volume 171, pages 227–248. Cambridge University Press, 2021.
- Efficient parameter estimation of truncated boolean product distributions. In Conference on Learning Theory, pages 1586–1600. PMLR, 2020.
- Perfect sampling from pairwise comparisons. arXiv preprint arXiv:2211.12868, 2022.
- Francis Galton. An examination into the registered speeds of american trotting horses, with remarks on their value as hereditary data. Proceedings of the Royal Society of London, 62(379-387):310–315, 1898.
- Learning hard-constrained models with one sample. arXiv preprint arXiv:2311.03332, 2023.
- Beyond perturbations: Learning guarantees with arbitrary adversarial test examples. Advances in Neural Information Processing Systems, 33:15859–15870, 2020.
- In search of lost domain generalization. arXiv preprint arXiv:2007.01434, 2020.
- Anti-concentration of polynomials: Dimension-free covariance bounds and decay of fourier coefficients. Journal of Functional Analysis, 283(9):109639, 2022.
- On calibration of modern neural networks. In International conference on machine learning, pages 1321–1330. PMLR, 2017.
- What can transformers learn in-context? a case study of simple function classes. Advances in Neural Information Processing Systems, 35:30583–30598, 2022.
- On the value of target data in transfer learning. Advances in Neural Information Processing Systems, 32, 2019.
- Bounding the sensitivity of polynomial threshold functions. arXiv preprint arXiv:0909.5175, 2009.
- Limits of model selection under transfer learning. arXiv preprint arXiv:2305.00152, 2023.
- Early-stopped neural networks are consistent. Advances in Neural Information Processing Systems, 34:1805–1817, 2021.
- A data-based perspective on transfer learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3613–3622, 2023.
- The implicit bias of gradient descent on nonseparable data. In Conference on Learning Theory, pages 1772–1798. PMLR, 2019.
- Daniel M Kane. A structure theorem for poorly anticoncentrated gaussian chaoses and applications to the study of polynomial threshold functions. In 2012 IEEE 53rd Annual Symposium on Foundations of Computer Science, pages 91–100. IEEE, 2012.
- Efficient learning with arbitrary covariate shift. In Algorithmic Learning Theory, pages 850–864. PMLR, 2021.
- Learning halfspaces with malicious noise. Journal of Machine Learning Research, 10(12), 2009.
- Marginal singularity, and the benefits of labels in covariate-shift. In Conference On Learning Theory, pages 1882–1886. PMLR, 2018.
- Samory Kpotufe. Lipschitz density-ratios, structured data, and data-driven tuning. In Artificial Intelligence and Statistics, pages 1320–1328. PMLR, 2017.
- Deep neural networks tend to extrapolate predictably. arXiv preprint arXiv:2310.00873, 2023.
- Testable learning with distribution shift. arXiv preprint arXiv:2311.15142, 2023.
- Learning and covering sums of independent random variables with unbounded support. Advances in Neural Information Processing Systems, 35:25185–25197, 2022.
- Efficient truncated statistics with unknown truncation. In 2019 IEEE 60th Annual Symposium on Foundations of Computer Science (FOCS), pages 1578–1595. IEEE, 2019.
- Transfer learning in large-scale gaussian graphical models with false discovery rate control. Journal of the American Statistical Association, pages 1–13, 2022.
- Transformers as algorithms: Generalization and implicit model selection in in-context learning. arXiv preprint arXiv:2301.07067, 2023.
- Gradient descent maximizes the margin of homogeneous neural networks. arXiv preprint arXiv:1906.05890, 2019.
- On the number of real roots of a random algebraic equation. ii. In Mathematical Proceedings of the Cambridge Philosophical Society, volume 35, pages 133–148. Cambridge University Press, 1939.
- Gradient descent with early stopping is provably robust to label noise for overparameterized neural networks. In International conference on artificial intelligence and statistics, pages 4313–4324. PMLR, 2020.
- Minimax lower bounds for transfer learning with linear and one-hidden layer neural networks. Advances in Neural Information Processing Systems, 33:1959–1969, 2020.
- Domain adaptation: Learning bounds and algorithms. arXiv preprint arXiv:0902.3430, 2009.
- Multiple source adaptation and the rényi divergence. arXiv preprint arXiv:1205.2628, 2012.
- Anti-concentration for polynomials of independent random variables. arXiv preprint arXiv:1507.00829, 2015.
- Noise stability of functions with low influences: invariance and optimality. In 46th Annual IEEE Symposium on Foundations of Computer Science (FOCS’05), pages 21–30. IEEE, 2005.
- Optimally tackling covariate shift in rkhs-based nonparametric regression. The Annals of Statistics, 51(2):738–761, 2023.
- On the analysis of em for truncated mixtures of two gaussians. In Algorithmic Learning Theory, pages 634–659. PMLR, 2020.
- Small ball probability, inverse theorems, and applications. Erdős centennial, pages 409–463, 2013.
- Ryan O’Donnell. Analysis of boolean functions. Cambridge University Press, 2014.
- Karl Pearson. On the systematic fitting of frequency curves. Biometrika, 2:2–7, 1902.
- A new similarity measure for covariate shift with applications to nonparametric regression. In International Conference on Machine Learning, pages 17517–17530. PMLR, 2022.
- Implicit bias of sgd for diagonal linear networks: a provable benefit of stochasticity. Advances in Neural Information Processing Systems, 34:29218–29230, 2021.
- Inverse density as an inverse problem: The fredholm equation approach. Advances in neural information processing systems, 26, 2013.
- Adaptive transfer learning. The Annals of Statistics, 49(6):3618–3649, 2021.
- Alfréd Rényi. On measures of entropy and information. In Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Contributions to the Theory of Statistics, volume 4, pages 547–562. University of California Press, 1961.
- A survey on domain adaptation theory: learning bounds and theoretical guarantees. arXiv preprint arXiv:2004.11829, 2020.
- Real advantage. ACM Transactions on Computation Theory (TOCT), 5(4):1–8, 2013.
- Tackling combinatorial distribution shift: A matrix completion perspective. In The Thirty Sixth Annual Conference on Learning Theory, pages 3356–3468. PMLR, 2023.
- Linear transformers are secretly fast weight programmers. In International Conference on Machine Learning, pages 9355–9366. PMLR, 2021.
- Machine learning in non-stationary environments: Introduction to covariate shift adaptation. MIT press, 2012.
- Covariate shift adaptation by importance weighted cross validation. Journal of Machine Learning Research, 8(5), 2007.
- Density ratio estimation in machine learning. Cambridge University Press, 2012.
- The pitfalls of simplicity bias in neural networks. Advances in Neural Information Processing Systems, 33:9573–9585, 2020.
- On the theory of transfer learning: The importance of task diversity. Advances in neural information processing systems, 33:7852–7862, 2020.
- Transformers learn in-context by gradient descent. In International Conference on Machine Learning, pages 35151–35174. PMLR, 2023.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- A survey of transfer learning. Journal of Big data, 3(1):1–40, 2016.
- How neural networks extrapolate: From feedforward to graph neural networks. arXiv preprint arXiv:2009.11848, 2020.
- A theory of transfer learning with applications to active learning. Machine learning, 90:161–189, 2013.
- Bounds on the minimax rate for estimating a prior over a vc class from independent learning tasks. Theoretical Computer Science, 716:124–140, 2018.
- On early stopping in gradient descent learning. Constructive Approximation, 26:289–315, 2007.
- Trained transformers learn linear models in-context. arXiv preprint arXiv:2306.09927, 2023.
- A comprehensive survey on transfer learning. Proceedings of the IEEE, 109(1):43–76, 2020.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.