Scaling laws for learning with real and surrogate data (2402.04376v3)

Published 6 Feb 2024 in cs.LG, cs.AI, and stat.ML

Abstract: Collecting large quantities of high-quality data can be prohibitively expensive or impractical, and a bottleneck in machine learning. One may instead augment a small set of $n$ data points from the target distribution with data from more accessible sources, e.g. data collected under different circumstances or synthesized by generative models. We refer to such data as `surrogate data'. We study a weighted empirical risk minimization (ERM) approach for integrating surrogate data into training. We analyze mathematically this method under several classical statistical models, and validate our findings empirically on datasets from different domains. Our main findings are: $(i)$ Integrating surrogate data can significantly reduce the test error on the original distribution. Surprisingly, this can happen even when the surrogate data is unrelated to the original ones. We trace back this behavior to the classical Stein's paradox. $(ii)$ In order to reap the benefit of surrogate data, it is crucial to use optimally weighted ERM. $(iii)$ The test error of models trained on mixtures of real and surrogate data is approximately described by a scaling law. This scaling law can be used to predict the optimal weighting scheme, and to choose the amount of surrogate data to add.

Authors (3)

Ayush Jain (49 papers)
Andrea Montanari (165 papers)
Eren Sasoglu (15 papers)

Citations (10)

View on Semantic Scholar

Summary

The paper introduces a scaling law that predicts test error reductions when integrating surrogate data with real data.
The methodology employs optimally weighted empirical risk minimization to balance the influence of surrogate and real data.
Empirical experiments across domains like image classification and sentiment analysis validate the performance gains from this approach.

Overview of "Scaling Laws for Learning with Real and Surrogate Data"

The paper investigates the potential of using surrogate data to enhance machine learning models, particularly when collecting large amounts of high-quality data is expensive or impractical. The authors propose a framework for integrating surrogate data—data that comes from easily accessible sources such as public datasets, data collected under different circumstances, or data generated by models—into the training process. They employ both theoretical models and empirical studies to explore the effects of this integration.

Key Findings

Reduction in Test Error: The integration of surrogate data can significantly decrease the test error of models on the target distribution. This is examined through empirical risk minimization that leverages both real and surrogate data.
Optimal Weighting is Crucial: For surrogate data to be beneficial, the empirical risk minimization must employ optimally weighted strategies. This involves finding the right balance in the weighting of surrogate data relative to real data.
Scaling Law for Prediction: The paper introduces a scaling law that effectively describes the test error when models are trained on a mixture of real and surrogate data. This scaling law can predict the optimal weighting of surrogate data and the extent of gains from its inclusion.

Methodological Approach

The authors propose a weighted empirical risk minimization scheme that integrates surrogate data into model training. They provide a comprehensive framework using both simulated and real-world datasets.
Several empirical experiments are conducted across different domains, such as sentiment analysis and image classification, to validate the theoretical scaling laws and the effectiveness of the surrogate data integration.

Theoretical Contributions

Low-Dimensional Asymptotics: The authors explore the behavior of the proposed method under low-dimensional asymptotics, ensuring the theoretical underpinning of their scaling laws.
Non-Parametric Models: The paper also extends to non-parametric settings, allowing for function estimation when the underlying models are not confined to specific parametric forms.
High-Dimensional Linear Regression: The authors undertake a detailed analysis of their method in the regime of high-dimensional regression, showing how surrogate data can be beneficial under proportional scaling of samples and dimensions.

Implications and Future Directions

The implications of this research are substantial for practical applications in AI, particularly in scenarios where data from the target distribution is scarce or expensive to obtain. The proposed framework offers a systematic way to leverage readily available surrogate data, thereby potentially reducing the cost and effort associated with data collection.

Theoretical insights from the paper suggest pathways for further research, including:

Exploration of more sophisticated surrogate data generation techniques.
Expansion of the scaling laws to more complex models beyond the linear and parametric frameworks.
Investigation into the impact of different types of distribution shifts between real and surrogate data.

The paper contributes fundamentally to the understanding of how integrating surrogate data can help optimize the training of machine learning models and offers a robust framework for predicting the benefits and limitations of such approaches.

PDF Markdown

Related Papers

Tweets

https://twitter.com/KempeLab/status/1817135401124934089

https://twitter.com/StatMLPapers/status/1755457481742971295

https://twitter.com/realmofresearch/status/1808516133181632749