Surrogate Conditional Data Extraction (SIDE)

Updated 9 June 2026

Surrogate Conditional Data Extraction (SIDE) is a framework that converts unconditional modeling tasks into conditional ones using learned surrogate conditions.
It employs strategies such as classifier guidance, conditional GANs, and analytic filtering to extract hidden data features from complex systems.
SIDE methods have demonstrated enhanced performance in generative diffusion models, physical simulation, and invariant feature learning, improving data attribution and uncertainty quantification.

Surrogate Conditional Data Extraction (SIDE) encompasses a class of principled methodologies for transforming unconditional modeling or prediction tasks into conditional ones through the construction or learning of surrogate conditional structures. SIDE methodologies have been developed in a diverse range of domains, including data attribution and extraction in generative modeling, surrogate modeling for high-dimensional dynamical systems, invariant representation learning, and uncertainty quantification. Core to all SIDE variants is the extraction, estimation, or prediction of data or features from complex systems by leveraging surrogate conditional representations—typically when direct observation or explicit conditioning is infeasible.

1. Foundations and Motivation

SIDE is motivated by the problem of extracting, reconstructing, or attributing data or features in complex probabilistic systems where direct conditionalization is unavailable. Classic applications include:

Data Memorization and Attribution: Recovering individual training instances from an unconditional generative model (e.g., diffusion models) by introducing synthetic conditions through surrogate classifiers (Chen et al., 2024, Chen et al., 2024).
Physical Surrogate Modeling: Approximating solutions to physics-constrained systems (e.g., PDE- or SDE-based models) by learning fast, data-driven mappings that emulate the conditional output distribution, typically using GANs or flows (Marcus et al., 2021, Yang et al., 2024).
Hidden State Estimation: Extracting unobserved ("hidden") states from partially observed nonlinear dynamical systems by exploiting analytically tractable conditional distributions (Chen et al., 2021).
Invariant Feature Learning: Extracting representations invariant to nuisance or confounding variables by enforcing surrogate conditional independence (Bounos et al., 24 Dec 2025).

SIDE approaches are typically invoked when explicit conditional mechanisms (labels, conditioning variables, or full observability) are unavailable or unreliable, necessitating the construction of surrogate conditional structures.

2. SIDE in Generative Diffusion Models

In the context of diffusion probabilistic models (DPMs), SIDE has been formulated to recover training data points from unconditional DPMs by converting unconditional sampling into a guided conditional process via surrogate classifiers (Chen et al., 2024, Chen et al., 2024). Existing attacks on data extraction from diffusion models are mainly effective for conditional DPMs; SIDE enables targeted extraction even from unconditional models.

Mathematical Framework:

Sampling from an unconditional DPM follows the SDE: $dx = [f(x, t) - g(t)^2 \nabla_x \log p_\theta^t(x)]\, dt + g(t) dw$

SIDE modifies this with a classifier-guided term: $dx = [f(x, t) - g(t)^2 (\nabla_x \log p_\theta^t(x) + \lambda \nabla_x \log p_\theta^t(y_I | x))]\, dt + g(t) dw$ where $y_I$ is a surrogate label from a time-dependent classifier, and $\lambda$ tunes guidance strength.

Surrogate Label Construction and Algorithmic Steps:

Synthesize images via the unconditional DPM.
Label generated samples with a pre-trained base classifier, obtaining pseudo-labels.
Distill a time-dependent classifier $p_\theta^t(y_I|x_t)$ to match these labels at all diffusion steps.
During extraction, apply the classifier’s gradient guidance to drive sampling toward memorized clusters.

Performance:

On CelebA datasets, SIDE achieves 50–87 % higher average memorization scores (AMS) and unique memorization scores (UMS) compared to random sampling baselines, with empirical extraction matches verified via similarity measures (Chen et al., 2024, Chen et al., 2024).

3. SIDE in Surrogate Physical and Stochastic Modeling

SIDE has been utilized to replace computationally intensive physical-model-based simulation with fast, conditionally generative surrogates:

Conditional GAN Surrogates for Subsurface Modeling: A conditional GAN (cGAN) is trained to map seismic, tomography, and travel-time inputs to high-fidelity velocity fields matching full-waveform inversion outputs. The GAN generator is conditioned on multiple inputs, adversarially trained against a discriminator, with additional $L_1$ loss to enforce physical fidelity (Marcus et al., 2021).
Conditional Pseudo-Reversible Normalizing Flows (PR-NF): For noisy forward models, a conditional flow, trained to minimize KL divergence between empirical and model-conditioned distributions, is used for mapping inputs $x$ and random noise to outputs $y$ , efficiently approximating $p(y|x)$ and enabling sampling and uncertainty quantification (Yang et al., 2024).

Evaluation Metrics and Results:

Representative results from these settings include percent error $\lesssim 1.5\%$ and SSIM $dx = [f(x, t) - g(t)^2 (\nabla_x \log p_\theta^t(x) + \lambda \nabla_x \log p_\theta^t(y_I | x))]\, dt + g(t) dw$ 0 when matching geophysical field data using a cGAN surrogate trained as an FWI extractor (Marcus et al., 2021), and normalized mean/covariance/KL errors $dx = [f(x, t) - g(t)^2 (\nabla_x \log p_\theta^t(x) + \lambda \nabla_x \log p_\theta^t(y_I | x))]\, dt + g(t) dw$ 1 for conditional PR-NF surrogates on synthetic and field tasks (Yang et al., 2024).

4. Analytic SIDE for Conditional Gaussian Nonlinear Systems

In nonlinear stochastic systems with partial observations (Conditional Gaussian Nonlinear Systems, CGNS), SIDE provides analytic extraction of unobserved states using closed-form filtering and smoothing (Chen et al., 2021):

The CGNS formulation splits the full system into observed and hidden components, leveraging exact conditional Gaussianity for posterior inference.
The algorithm consists of filtering to obtain real-time (minimum-MSE) estimates of hidden states, smoothing for fixed-interval posterior means/covariances, and backward sampling for full trajectory extraction.
Computational cost is $dx = [f(x, t) - g(t)^2 (\nabla_x \log p_\theta^t(x) + \lambda \nabla_x \log p_\theta^t(y_I | x))]\, dt + g(t) dw$ 2 per pass; analytic SIDE outperforms ensemble- or particle-based methods in both accuracy and stability for systems with strong nonlinearity.

5. Invariant Feature Extraction by Surrogate Conditioning

SIDE also describes frameworks to extract features invariant to confounding variables by employing surrogate conditional independence constraints, often implemented via optimal transport barycenter relaxations in the linear-Gaussian setting (Bounos et al., 24 Dec 2025):

The SIDE objective seeks features $dx = [f(x, t) - g(t)^2 (\nabla_x \log p_\theta^t(x) + \lambda \nabla_x \log p_\theta^t(y_I | x))]\, dt + g(t) dw$ 3 that maximize predictive correlation with $dx = [f(x, t) - g(t)^2 (\nabla_x \log p_\theta^t(x) + \lambda \nabla_x \log p_\theta^t(y_I | x))]\, dt + g(t) dw$ 4 while minimizing dependence on confounders $dx = [f(x, t) - g(t)^2 (\nabla_x \log p_\theta^t(x) + \lambda \nabla_x \log p_\theta^t(y_I | x))]\, dt + g(t) dw$ 5 (or surrogates $dx = [f(x, t) - g(t)^2 (\nabla_x \log p_\theta^t(x) + \lambda \nabla_x \log p_\theta^t(y_I | x))]\, dt + g(t) dw$ 6), formalized as: $dx = [f(x, t) - g(t)^2 (\nabla_x \log p_\theta^t(x) + \lambda \nabla_x \log p_\theta^t(y_I | x))]\, dt + g(t) dw$ 7 with $dx = [f(x, t) - g(t)^2 (\nabla_x \log p_\theta^t(x) + \lambda \nabla_x \log p_\theta^t(y_I | x))]\, dt + g(t) dw$ 8 and $dx = [f(x, t) - g(t)^2 (\nabla_x \log p_\theta^t(x) + \lambda \nabla_x \log p_\theta^t(y_I | x))]\, dt + g(t) dw$ 9.
The solution is given by the leading eigenvectors of $y_I$ 0.
This approach generalizes to non-Gaussian and nonlinear settings via kernelized independence penalties or barycenter estimation.

6. Limitations, Assumptions, and Extensions

SIDE methods are constrained by the quality of surrogate condition construction and model expressivity:

Extracting from unconditional DPMs depends on surrogate classifier calibration and the richness of class partitions. Poor calibration or limited cluster structure in data reduces SIDE’s effectiveness (Chen et al., 2024).
For surrogate surrogate models (e.g., GANs or flows), inability to extrapolate beyond the training distribution or unmodeled noise/physics introduces artifacts or inaccuracies (Marcus et al., 2021, Yang et al., 2024).
Analytic extraction (e.g., for CGNS or invariant features) imposes structural (e.g., Gaussianity, linearity, full-rank covariance) requirements whose violation may weaken guarantees or necessitate nonlinear, nonparametric extensions (Bounos et al., 24 Dec 2025, Chen et al., 2021).

Proposed extensions include the incorporation of self-supervised or clustering-based surrogates, physics-informed or PDE-constrained loss functions, and extensions to multimodal, high-dimensional, or sequential data.

7. Representative Algorithms and Metrics

A summary table highlighting principal SIDE usage patterns across domains:

Domain	Surrogate Construction	Extraction Principle
Diffusion Models	Time-dependent classifier	Guided reverse SDE sampling
Physical Surrogates	cGAN/PR-NF conditioned on input	Direct distributional inference
CGNS (Hidden Estimation)	System-theoretic conditionality	Closed-form filtering/smoothing
Invariant Feature Learning	OT barycenter relaxation	Eigenvector extraction

Quantitative evaluation commonly employs domain-appropriate metrics (AMS, UMS for memorization, SSIM/PE for surrogate quality, RMSE/correlation for estimation) with empirical SIDE variants consistently outperforming unconditional or non-surrogate baselines under matched conditions.

SIDE represents a unifying abstraction for extracting data or structure from systems lacking explicit or accessible conditional mechanisms. By synthesizing surrogate conditions—whether via classifiers, model architectures, analytic forms, or optimal transport—SIDE enables tractable and often highly efficient estimation, attribution, and invariant representation in settings where direct conditionalization is unavailable or ill-posed (Chen et al., 2024, Chen et al., 2024, Marcus et al., 2021, Chen et al., 2021, Bounos et al., 24 Dec 2025, Yang et al., 2024).