Learning from Synthetic Data: Limitations of ERM

Published 21 Jan 2026 in cs.LG, cs.DS, and stat.ML | (2601.15468v1)

Abstract: The prevalence and low cost of LLMs have led to a rise of synthetic content. From review sites to court documents, ``natural'' content has been contaminated by data points that appear similar to natural data, but are in fact LLM-generated. In this work we revisit fundamental learning theory questions in this, now ubiquitous, setting. We model this scenario as a sequence of learning tasks where the input is a mix of natural and synthetic data, and the learning algorithms are oblivious to the origin of any individual example. We study the possibilities and limitations of ERM in this setting. For the problem of estimating the mean of an arbitrary $d$-dimensional distribution, we find that while ERM converges to the true mean, it is outperformed by an algorithm that assigns non-uniform weights to examples from different generations of data. For the PAC learning setting, the disparity is even more stark. We find that ERM does not always converge to the true concept, echoing the model collapse literature. However, we show there are algorithms capable of learning the correct hypothesis for arbitrary VC classes and arbitrary amounts of contamination.

Abstract PDF Upgrade to Chat

Summary

The paper rigorously characterizes how synthetic data contamination challenges ERM optimality, identifying variance phase transitions when α exceeds 1/2.
The paper demonstrates that in PAC learning, repeated ERM under synthetic contamination leads to a nonzero asymptotic error, particularly for high contamination rates.
The paper introduces universal non-ERM algorithms that achieve vanishing error rates despite pervasive synthetic data, though practical deployment challenges remain.

Learning from Synthetic Data: Fundamental Limitations of Empirical Risk Minimization

Problem Setting and Motivation

This paper provides a rigorous learning-theoretic examination of the effects of synthetic (LLM-generated) data contamination on two canonical learning paradigms: mean estimation and PAC learning. The rapid proliferation of generative models has resulted in widespread contamination of real-world datasets by LLM outputs, producing profound implications for iterative model training, transfer learning, and the reuse of large shared corpora. The authors formalize a setting where, in each iteration, fresh “natural” data is recursively contaminated with synthetic data produced by the previous model, at contamination rate $\alpha\in[0,1]$ , and the learner is oblivious to data provenance.

The core learning goal is to construct algorithms with continuously improving generalization error as training progresses, in the presence of arbitrary and potentially high fractions of synthetic data. This work departs from prior studies which focus largely on empirical performance in LLM settings or adversarial/label-noise models, and instead grounds the investigation in learning theory, characterizing the statistical and algorithmic consequences for foundational estimation and classification problems.

Mean Estimation Under Synthetic Data Contamination

The analysis begins with the mean estimation problem for $d$ -dimensional distributions. Let $D_0$ (natural) have mean $\mu$ and $D_1$ (synthetic) have mean equal to the previous round’s estimate $Y_{t-1}$ . At each step, a $(1-\alpha)$ proportion of samples come from $D_0$ and an $\alpha$ proportion from $D_1$ , with the aggregate empirical mean $Y_t$ forming the typical ERM-style estimate.

The authors derive an exact characterization for the variance of the empirical mean estimator under this contamination regime for all $\alpha$ . They show that for $\alpha>0$ , uniform weighting across rounds—typical of ERM or augmentation workflows—does not yield the minimum-variance unbiased estimator (MVUE) except in the i.i.d. setting ( $\alpha=0$ ). Specifically, for $\alpha\leq 1/2$ , the variance decays as $O(1/t)$ ; for $\alpha > 1/2$ , uniform weighting exhibits a phase transition with significantly slower decay, approaching constant error as $\alpha\to1$ .

Crucially, the authors prove existence of alternative weighting schemes achieving strictly lower variance than uniform weighting for $\alpha \in (\alpha^*,1]$ (Theorem~\ref{thm:not_mvue}). Further, in the recursive limit ( $\alpha=1$ ), uniform weighting’s variance lower bound matches the constants observed in earlier empirical collapse studies, solidifying the suboptimality of simple ERM for contaminated data. These findings sever the equivalence between ERM and optimal estimation in the presence of even moderate synthetic contamination, including for canonical families such as Gaussians.

PAC Learning and Failure of ERM

The second major contribution establishes stark failures of repeated ERM in the classical PAC learning model under synthetic contamination. The authors model recursive learning where, in each round, examples are labeled either by the true (unknown) concept $f^*$ or by the prior-round model $f_{t-1}$ , with probability $\alpha$ and $1-\alpha$ respectively.

They construct a simple threshold example where, above a critical contamination rate ( $\alpha > 1/2$ ), the generalization error of ERM classifiers stalls at a nonzero asymptotic lower bound, even as the number of generations grows. This behavior is quantified via analysis of a biased random walk: the odds that the learner ever recovers a correct label for specific points are bounded away from $1$, reflecting the regime where synthetic-label feedback drowns out true signals. This lower bound generalizes to all hypothesis classes with finite VC dimension.

Universal Algorithms with Non-ERM Strategies

Despite the negative results for ERM, the paper demonstrates the possibility of vanishing error with alternative (non-ERM) strategies:

The first universal algorithm (Theorem~\ref{thm:vc}) mixes ERM with a random classifier to collect sufficient unbiased labels; it achieves $O(t^{-1/4})$ convergence, relying on PU (positive-unlabeled) learning theory. However, it requires explicitly deploying a high-error classifier some fraction of the time, which is impractical for real deployments.
The second approach (Theorem~\ref{thm:vc_known_alpha}) leverages knowledge of $\alpha$ and reformulates the problem as one of learning disagreements between the previous model and the true concept. By reducing contaminated PAC learning to PU learning in the disagreement class, it achieves the more desirable $O((nt)^{-1/2})$ error decay for arbitrary VC classes and any $\alpha<1$ . This construction hinges on careful epoch scheduling and access to labeled disagreement information, possibly limiting its immediate operational applicability.

Numerical and Analytical Highlights

Variance of the empirical mean estimator is explicitly characterized for all $\alpha$ , with tight bounds showing phase transitions at $\alpha=1/2$ .
ERM is strictly suboptimal in estimation for any $\alpha$ where synthetic and natural data are mixed.
For PAC learning, a nonzero asymptotic lower bound is proven for repeated ERM when $\alpha>1/2$ for families with arbitrary VC dimension.
Universal algorithms achieving vanishing error for finite VC classes and arbitrary contamination rates are explicitly constructed, albeit with computational or practical caveats.

Implications and Future Directions

Practically, these results emphasize that naïve ERM on contaminated training sets—ubiquitous in current LLM retraining and data augmentation practices—can entrench estimation error and impede progress across generations, especially as $\alpha$ increases. The findings caution against undifferentiated use of synthetic data, regardless of the ability of synthetic generators to closely mimic natural data distributions.

Theoretically, this work raises open questions about the explicit construction of MVUEs in the contaminated regime, the extension of the PAC results to the agnostic learning setting, and the optimization of universal learning rates without knowledge of the contamination parameter $\alpha$ . Moreover, the connection to PU learning and disagreement-based algorithms opens avenues for practical semi-supervised and active learning protocols in data environments compromised by pervasive synthesis.

Conclusion

This paper provides a systematic learning-theoretic dissection of the statistical risks associated with recursive learning from mixed natural and synthetic data. It decisively separates the performance of ERM from optimality in both estimation and classification, offering both lower bounds and universal learners to map the landscape of what is achievable in the presence of synthetic data contamination. The implications are significant for iterative AI model development, self-improving systems, and the scalable deployment of generative data augmentation.

Markdown