Noise–Scale Trade-off in Synthetic Data Generation

Characterize the quantitative trade-off between synthetic data noise (errors and biases introduced by machine translation and teacher large language models) and dataset scale in synthetic training pipelines for large language models, and determine how this trade-off impacts downstream performance, particularly for low-resource languages.

Background

Synthetic data is central to scaling multilingual LLMs in this work, where translated prompts are further optimized via Naturalness, Cultural Adaptation, and Difficulty transformations. The authors note that model performance depends on the quality of the translation and teacher models supplying the synthetic data.

In examining lowest-resource languages, the paper highlights that the balance between increasing the amount of synthetic data and the level of noise inherent in that data is not well understood. Clarifying this relationship is crucial for designing effective multilingual synthetic data pipelines and deciding whether to prioritize data quantity or invest in improving data quality.

References

Our method depends on the performance of the translation model and teacher model. It is not well understood where the trade-off between noise and scale lie for synthetic data generation.

— The Art of Asking: Multilingual Prompt Optimization for Synthetic Data (2510.19806 - Mora et al., 22 Oct 2025) in Section “Analysis”, Subsection “Performance on Lowest-Resource Languages”

Noise–Scale Trade-off in Synthetic Data Generation

Background

References

Related Problems