Papers
Topics
Authors
Recent
2000 character limit reached

HS3F: Heterogeneous Sequential Feature Flow

Updated 20 December 2025
  • HS³F is a synthetic data generation technique that sequentially models individual features using conditional flows and XGBoost regressors.
  • Its methodology employs RK4 ODE integration for continuous features and multinomial sampling for categorical ones, significantly enhancing speed and fidelity.
  • Empirical benchmarks demonstrate that HS³F reduces Wasserstein distance and achieves 20.8–26.6× speedup over Forest Flow on datasets with high categorical proportions.

Heterogeneous Sequential Feature Forest Flow (HS³F) is an algorithmic method for synthetic tabular data generation that integrates sequential feature-wise modeling with heterogeneous treatment of continuous and categorical variables. Developed as an enhancement over the Forest Flow (FF) approach, HS³F addresses both speed and fidelity limitations in FF by utilizing conditional flows matched with XGBoost regressors for continuous features and direct multinomial sampling for categorical ones. This technique offers improved robustness to ODE initialization and significant computational acceleration, particularly in datasets that contain a high proportion of categorical features (Akazan et al., 2024).

1. Mathematical Framework and Methodological Innovations

HS³F originates from the Independent-Coupling Flow-Matching principle employed in FF, where a vector field

vt(x)=x1x0v_t(x) = x_1 - x_0

pushes a standard Gaussian noise distribution p0(x)=N(0,I)p_0(x) = \mathcal N(0, I) to the empirical data distribution q1(x)q_1(x) along the ODE

dxtdt=vt(xt)=f(xt,t),xt=0=x0p0,\frac{d x_t}{dt} = v_t(x_t) = f(x_t, t), \quad x_{t=0} = x_0 \sim p_0,

such that xt=1x1x_{t=1} \approx x_1. In FF, the velocity field is estimated with an XGBoost regressor trained on the linear "flow" path ϕt(x0,x1)=(1t)x0+tx1\phi_t(x_0, x_1) = (1-t)x_0 + t x_1 according to the ICFM loss

LICFM(θ)=Et[0,1],(x0,x1)p0×q1f^θ(ϕt)(x1x0)2.\mathcal{L}_{\mathrm{ICFM}}(\theta) = \mathbb{E}_{t\sim[0,1], (x_0, x_1)\sim p_0 \times q_1} \|\hat{f}_\theta(\phi_t) - (x_1 - x_0)\|^2.

HS³F modifies this by decomposing sample generation into a sequential process where individual features are generated conditioned on those previously sampled. For each feature kk, a distinct XGBoost regressor f^tk\hat{f}_t^k estimates

vtk(xtkx1,,xk1)=xkx0k.v_t^k(x_t^k \mid x^1, \dots, x^{k-1}) = x^k - x_0^k.

During sampling, given previously generated features {x~11,,x~1k1}\{\tilde{x}_1^1, \dots, \tilde{x}_1^{k-1}\}, new noise ztkN(0,I)z_t^k \sim \mathcal N(0, I) is used to solve

dx~tkdt=f^tk(ztk,x~11,,x~1k1),x~tkt=0=ztk,t[0,1].\frac{d \tilde{x}_t^k}{dt} = \hat{f}_t^k(z_t^k, \tilde{x}_1^1, \dots, \tilde{x}_1^{k-1}), \quad \tilde{x}_t^k|_{t=0} = z_t^k, \quad t \in [0, 1].

The first feature (k=1k=1) relies solely on noise, while subsequent features exploit conditioning, which reduces the distortion resulting from initial-noise mismatches.

HS³F employs a classical 4th-order Runge–Kutta (RK4) ODE solver: k1=f(xi,ti), k2=f(xi+h2k1,ti+h2), k3=f(xi+h2k2,ti+h2), k4=f(xi+hk3,ti+h), xi+1=xi+h6(k1+2k2+2k3+k4),\begin{aligned} k_1 &= f(x_i, t_i),\ k_2 &= f(x_i + \tfrac{h}{2} k_1, t_i + \tfrac{h}{2}),\ k_3 &= f(x_i + \tfrac{h}{2} k_2, t_i + \tfrac{h}{2}),\ k_4 &= f(x_i + h k_3, t_i + h),\ x_{i+1} &= x_i + \frac{h}{6}(k_1 + 2k_2 + 2k_3 + k_4), \end{aligned} which provides local error O(h4)O(h^4), markedly improving sample fidelity compared to the O(h)O(h) Euler solver.

2. Sequential Feature-wise Training and Sampling Procedure

The sample-generation algorithm is composed of two phases: training and synthesis. For each feature kk in a KK-feature dataset:

  • Continuous feature: For a grid of time points {ts}\{t_s\}, x0N(0,I)x^0 \sim \mathcal N(0, I) is sampled, and the flow path xt=(1t)x0+txkx_t = (1-t)x^0 + t x^k is constructed. f^tk\hat{f}_t^k is trained to predict the increment xkx0x^k - x^0.
  • Categorical feature: An XGBoost classifier fkf^k is trained, mapping (x1,,xk1)(x^1, \dots, x^{k-1}) to class labels xkx^k.

At synthesis time, features are sequentially generated:

  • Continuous: For each feature, draw Gaussian noise, integrate the learned ODE (using either Euler or RK4), then output x~1k\tilde{x}_1^k.
  • Categorical: Compute class probabilities using the classifier, then sample from the multinomial distribution.

At every step, continuous feature generation is conditioned on already-generated feature values, and categorical features are derived by multinomial sampling from classifier outputs.

3. Heterogeneous Treatment of Categorical Variables

HS³F introduces direct modeling for categorical features. For each categorical feature kk with JkJ_k categories, train a classifier

f(k):Rk1ΔJk1,f(k)(x1,,xk1)=(pk,1,,pk,Jk),jpk,j=1.f^{(k)}:\mathbb R^{k-1} \to \Delta^{J_k-1},\quad f^{(k)}(x^1,\dots,x^{k-1}) = (p_{k,1}, \dots, p_{k,J_k}),\quad \sum_j p_{k,j}=1.

Sampling is performed as

x~kCategorical(pk,1,,pk,Jk).\tilde{x}^k \sim \mathrm{Categorical}(p_{k,1},\dots,p_{k,J_k}).

By contrast, FF embeds categorical variables as one-hot continuous vectors and regresses over these expanded targets, increasing dimensionality and reducing efficiency in handling discrete variables. On datasets with 20%\geq20\% categorical inputs, such as blood_transfusion, congress_voting, car, tic_tac_toe, and glass, HS³F achieves 20.8–26.6× speedup relative to FF, and a substantial reduction in Wasserstein distance (WtrW_{tr} decreases from 1.064 to 0.596 for HS³F-Euler).

4. Robustness and Theoretical Properties

Sensitivity analysis indicates that joint FF approaches are highly dependent on the initial noise distribution for ODE integration. Perturbations such as shifting or scaling p0p_0 from N(0,I)\mathcal N(0, I) to N(μ,b2I)\mathcal N(\mu, b^2I) can yield error accumulation and increased W1W_1.

HS³F’s sequential, conditional design mitigates this issue. Only the first feature is fully susceptible to initial noise; subsequent features integrate information from previously generated features, introducing a stabilizing feedback mechanism.

Init p0p_0 ΔWtr\Delta W_{tr}(HS³F-Rg4) ΔWtr\Delta W_{tr}(CS3F-Rg4) ΔWtr\Delta W_{tr}(FF)
N(0.1,1.12I)\mathcal N(0.1,1.1^2I) 0.0085 0.0001 0.1462
N(0,0.92I)\mathcal N(0,0.9^2I) 0.0018 0.0006 0.0300
N(0,1.12I)\mathcal N(0,1.1^2I) 0.0028 0.0007 0.0510

HS³F-Rg4 exhibits an order-of-magnitude reduction in ΔW\Delta W versus FF, confirming strong stability to affine transformation in the ODE initial condition.

5. Empirical Benchmarks and Quantitative Analysis

Evaluation spans 25 standard tabular datasets from the UCI and scikit-learn repositories, comprising both regression and classification tasks and varied ratios of categorical features. Five datasets contain 20%\geq20\% categorical inputs.

Key evaluation metrics include train-test Wasserstein-1 distance (WtrW_{tr}, WteW_{te}), F1-scores (F1fakeF1_{fake}, F1combF1_{comb}), regression R2R^2 scores (Rfake2R^2_{fake}, Rcomb2R^2_{comb}), coverage (covtr\mathrm{cov}_{tr}, covte\mathrm{cov}_{te}), and synthesis time.

Model WtrW_{tr}\downarrow F1fakeF1_{fake}\uparrow covtr\mathrm{cov}_{tr}\uparrow time\mathrm{time}\downarrow
HS³F-Euler 1.283 0.738 0.787 278.4
HS³F-Rg4 1.233 0.741 0.819 331.1
Forest-Flow 1.356 0.723 0.839 1073.8

For the high-categorical subset, HS³F variants achieve Wtr=0.584W_{tr}=0.584–$0.596$ and are 20.8–26.6× faster than FF. On the full benchmark, HS³F-Rg4 attains the lowest WtrW_{tr} and highest F1fakeF1_{fake}; time performance is 3–4× superior compared to FF.

6. Ablation Experiments and Identified Limitations

Ablation results reflect the effect size of HS³F's individual components:

  • Sequential (Autoregressive) Generation: CS3F-Euler (splitting features) yields Wtr:1.3491.283W_{tr}:1.349\to1.283 in comparison to FF, indicating moderate benefit from autoregression.
  • RK4 Integration: Switching CS3F-Euler to CS3F-Rg4 slightly degrades WtrW_{tr}, whereas for HS³F, Euler\toRg4 shifts WtrW_{tr} from 1.283→1.233, demonstrating that solver choice interacts strongly with heterogeneity.
  • Multinomial Sampling for Categories: Transition from CS3F-Euler (Wtr=0.926W_{tr}=0.926) to HS³F-Euler ($0.596$) on high-categorical datasets underscores the substantial improvement from categorical modeling.

Limitations include sequential dependence: spurious early-feature correlations may propagate errors downstream. The implementation does not automatically order features causally; improvements may be achievable by incorporating feature ordering or grouping heuristics. Additionally, extending sequential flows to handle mixed continuous/categorical variables or fully differentiable discrete flows constitutes a plausible direction for future work.

7. Summary and Implications

HS³F combines per-feature conditional flow matching, RK4 ODE integration, and direct multinomial sampling for categorical features. These innovations result in synthetic tabular data that exhibit enhanced quality, robustness to ODE initialization shifts, and significant speed advantages, particularly in categorical-rich settings. A plausible implication is the accelerated generation of high-fidelity synthetic tabular data for privacy-preserving machine learning and regulatory compliance applications (Akazan et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Heterogeneous Sequential Feature Forest Flow (HS3F).