Balancing desiderata in large‑scale synthetic data generation

Determine how to best balance the desiderata of large‑scale synthetic data generation—specifically quality, diversity, and complexity—to meet practical requirements at scale.

Background

The paper frames three core challenges for synthetic data: defining what constitutes good data, designing mechanisms that meet real-world requirements, and conducting generalizable evaluations. Prior work often optimizes only subsets of desired properties, typically along the axes of quality, diversity, and complexity. Despite progress, the authors explicitly acknowledge that deciding how to trade off these desiderata at scale is unresolved.

References

Nevertheless, how to best balance the various desiderata of synthetic data generation at scale remains an open question.

Reasoning-Driven Synthetic Data Generation and Evaluation  (2603.29791 - Davidson et al., 31 Mar 2026) in Section 1 (Introduction)