Noise–Scale Trade-off in Synthetic Data Generation
Characterize the quantitative trade-off between synthetic data noise (errors and biases introduced by machine translation and teacher large language models) and dataset scale in synthetic training pipelines for large language models, and determine how this trade-off impacts downstream performance, particularly for low-resource languages.
References
Our method depends on the performance of the translation model and teacher model. It is not well understood where the trade-off between noise and scale lie for synthetic data generation.
— The Art of Asking: Multilingual Prompt Optimization for Synthetic Data
(2510.19806 - Mora et al., 22 Oct 2025) in Section “Analysis”, Subsection “Performance on Lowest-Resource Languages”