Optimal number of synthetic samples for tabular augmentation

Determine the optimal size N_syn of the synthetic dataset to use in tabular data augmentation for classification tasks on tabular datasets.

Background

Selecting how many synthetic samples to generate for augmenting tabular datasets is nontrivial and impacts downstream classifier performance and stability. Prior work often defaults to setting the number of synthetic samples equal to the number of real samples, but the authors observe this can be unstable, especially on small datasets, and therefore fix N_syn to 500 for consistent comparisons.

The paper notes that identifying an optimal choice of N_syn is not settled in the literature, motivating the need for a principled determination of this quantity.

References

The optimal $N_{\text{syn}$ remains an open problem for tabular data.

TabEBM: A Tabular Data Augmentation Method with Distinct Class-Specific Energy-Based Models  (2409.16118 - Margeloiu et al., 2024) in Data augmentation setup, Section 3 (Experiments)