Diffusion-Driven High-Dimensional Variable Selection

Published 19 Aug 2025 in stat.ME and stat.ML | (2508.13890v1)

Abstract: Variable selection for high-dimensional, highly correlated data has long been a challenging problem, often yielding unstable and unreliable models. We propose a resample-aggregate framework that exploits diffusion models' ability to generate high-fidelity synthetic data. Specifically, we draw multiple pseudo-data sets from a diffusion model fitted to the original data, apply any off-the-shelf selector (e.g., lasso or SCAD), and store the resulting inclusion indicators and coefficients. Aggregating across replicas produces a stable subset of predictors with calibrated stability scores for variable selection. Theoretically, we show that the proposed method is selection consistent under mild assumptions. Because the generative model imports knowledge from large pre-trained weights, the procedure naturally benefits from transfer learning, boosting power when the observed sample is small or noisy. We also extend the framework of aggregating synthetic data to other model selection problems, including graphical model selection, and statistical inference that supports valid confidence intervals and hypothesis tests. Extensive simulations show consistent gains over the lasso, stability selection, and knockoff baselines, especially when predictors are strongly correlated, achieving higher true-positive rates and lower false-discovery proportions. By coupling diffusion-based data augmentation with principled aggregation, our method advances variable selection methodology and broadens the toolkit for interpretable, statistically rigorous analysis in complex scientific applications.