Generating Synthetic Data with The Nearest Neighbors Algorithm (2210.00884v1)
Abstract: The $k$ nearest neighbor algorithm ($k$NN) is one of the most popular nonparametric methods used for various purposes, such as treatment effect estimation, missing value imputation, classification, and clustering. The main advantage of $k$NN is its simplicity of hyperparameter optimization. It often produces favorable results with minimal effort. This paper proposes a generic semiparametric (or nonparametric if required) approach named Local Resampler (LR). LR utilizes $k$NN to create subsamples from the original sample and then generates synthetic values that are drawn from locally estimated distributions. LR can accurately create synthetic samples, even if the original sample has a non-convex distribution. Moreover, LR shows better or similar performance to other popular synthetic data methods with minimal model optimization with parametric distributional assumptions.