HS3F: Heterogeneous Sequential Feature Flow

Updated 20 December 2025

HS³F is a synthetic data generation technique that sequentially models individual features using conditional flows and XGBoost regressors.
Its methodology employs RK4 ODE integration for continuous features and multinomial sampling for categorical ones, significantly enhancing speed and fidelity.
Empirical benchmarks demonstrate that HS³F reduces Wasserstein distance and achieves 20.8–26.6× speedup over Forest Flow on datasets with high categorical proportions.

Heterogeneous Sequential Feature Forest Flow (HS³F) is an algorithmic method for synthetic tabular data generation that integrates sequential feature-wise modeling with heterogeneous treatment of continuous and categorical variables. Developed as an enhancement over the Forest Flow (FF) approach, HS³F addresses both speed and fidelity limitations in FF by utilizing conditional flows matched with XGBoost regressors for continuous features and direct multinomial sampling for categorical ones. This technique offers improved robustness to ODE initialization and significant computational acceleration, particularly in datasets that contain a high proportion of categorical features (Akazan et al., 2024).

1. Mathematical Framework and Methodological Innovations

HS³F originates from the Independent-Coupling Flow-Matching principle employed in FF, where a vector field

$v_t(x) = x_1 - x_0$

pushes a standard Gaussian noise distribution $p_0(x) = \mathcal N(0, I)$ to the empirical data distribution $q_1(x)$ along the ODE

$\frac{d x_t}{dt} = v_t(x_t) = f(x_t, t), \quad x_{t=0} = x_0 \sim p_0,$

such that $x_{t=1} \approx x_1$ . In FF, the velocity field is estimated with an XGBoost regressor trained on the linear "flow" path $\phi_t(x_0, x_1) = (1-t)x_0 + t x_1$ according to the ICFM loss

$\mathcal{L}_{\mathrm{ICFM}}(\theta) = \mathbb{E}_{t\sim[0,1], (x_0, x_1)\sim p_0 \times q_1} \|\hat{f}_\theta(\phi_t) - (x_1 - x_0)\|^2.$

HS³F modifies this by decomposing sample generation into a sequential process where individual features are generated conditioned on those previously sampled. For each feature $k$ , a distinct XGBoost regressor $\hat{f}_t^k$ estimates

$v_t^k(x_t^k \mid x^1, \dots, x^{k-1}) = x^k - x_0^k.$

During sampling, given previously generated features $\{\tilde{x}_1^1, \dots, \tilde{x}_1^{k-1}\}$ , new noise $z_t^k \sim \mathcal N(0, I)$ is used to solve

$\frac{d \tilde{x}_t^k}{dt} = \hat{f}_t^k(z_t^k, \tilde{x}_1^1, \dots, \tilde{x}_1^{k-1}), \quad \tilde{x}_t^k|_{t=0} = z_t^k, \quad t \in [0, 1].$

The first feature ( $k=1$ ) relies solely on noise, while subsequent features exploit conditioning, which reduces the distortion resulting from initial-noise mismatches.

HS³F employs a classical 4th-order Runge–Kutta (RK4) ODE solver: $\begin{aligned} k_1 &= f(x_i, t_i),\ k_2 &= f(x_i + \tfrac{h}{2} k_1, t_i + \tfrac{h}{2}),\ k_3 &= f(x_i + \tfrac{h}{2} k_2, t_i + \tfrac{h}{2}),\ k_4 &= f(x_i + h k_3, t_i + h),\ x_{i+1} &= x_i + \frac{h}{6}(k_1 + 2k_2 + 2k_3 + k_4), \end{aligned}$ which provides local error $O(h^4)$ , markedly improving sample fidelity compared to the $O(h)$ Euler solver.

2. Sequential Feature-wise Training and Sampling Procedure

The sample-generation algorithm is composed of two phases: training and synthesis. For each feature $k$ in a $K$ -feature dataset:

Continuous feature: For a grid of time points $\{t_s\}$ , $x^0 \sim \mathcal N(0, I)$ is sampled, and the flow path $x_t = (1-t)x^0 + t x^k$ is constructed. $\hat{f}_t^k$ is trained to predict the increment $x^k - x^0$ .
Categorical feature: An XGBoost classifier $f^k$ is trained, mapping $(x^1, \dots, x^{k-1})$ to class labels $x^k$ .

At synthesis time, features are sequentially generated:

Continuous: For each feature, draw Gaussian noise, integrate the learned ODE (using either Euler or RK4), then output $\tilde{x}_1^k$ .
Categorical: Compute class probabilities using the classifier, then sample from the multinomial distribution.

At every step, continuous feature generation is conditioned on already-generated feature values, and categorical features are derived by multinomial sampling from classifier outputs.

3. Heterogeneous Treatment of Categorical Variables

HS³F introduces direct modeling for categorical features. For each categorical feature $k$ with $J_k$ categories, train a classifier

$f^{(k)}:\mathbb R^{k-1} \to \Delta^{J_k-1},\quad f^{(k)}(x^1,\dots,x^{k-1}) = (p_{k,1}, \dots, p_{k,J_k}),\quad \sum_j p_{k,j}=1.$

Sampling is performed as

$\tilde{x}^k \sim \mathrm{Categorical}(p_{k,1},\dots,p_{k,J_k}).$

By contrast, FF embeds categorical variables as one-hot continuous vectors and regresses over these expanded targets, increasing dimensionality and reducing efficiency in handling discrete variables. On datasets with $\geq20\%$ categorical inputs, such as blood_transfusion, congress_voting, car, tic_tac_toe, and glass, HS³F achieves 20.8–26.6× speedup relative to FF, and a substantial reduction in Wasserstein distance ( $W_{tr}$ decreases from 1.064 to 0.596 for HS³F-Euler).

4. Robustness and Theoretical Properties

Sensitivity analysis indicates that joint FF approaches are highly dependent on the initial noise distribution for ODE integration. Perturbations such as shifting or scaling $p_0$ from $\mathcal N(0, I)$ to $\mathcal N(\mu, b^2I)$ can yield error accumulation and increased $W_1$ .

HS³F’s sequential, conditional design mitigates this issue. Only the first feature is fully susceptible to initial noise; subsequent features integrate information from previously generated features, introducing a stabilizing feedback mechanism.

Init $p_0$	$\Delta W_{tr}$ (HS³F-Rg4)	$\Delta W_{tr}$ (CS3F-Rg4)	$\Delta W_{tr}$ (FF)
$\mathcal N(0.1,1.1^2I)$	0.0085	0.0001	0.1462
$\mathcal N(0,0.9^2I)$	0.0018	0.0006	0.0300
$\mathcal N(0,1.1^2I)$	0.0028	0.0007	0.0510

HS³F-Rg4 exhibits an order-of-magnitude reduction in $\Delta W$ versus FF, confirming strong stability to affine transformation in the ODE initial condition.

5. Empirical Benchmarks and Quantitative Analysis

Evaluation spans 25 standard tabular datasets from the UCI and scikit-learn repositories, comprising both regression and classification tasks and varied ratios of categorical features. Five datasets contain $\geq20\%$ categorical inputs.

Key evaluation metrics include train-test Wasserstein-1 distance ( $W_{tr}$ , $W_{te}$ ), F1-scores ( $F1_{fake}$ , $F1_{comb}$ ), regression $R^2$ scores ( $R^2_{fake}$ , $R^2_{comb}$ ), coverage ( $\mathrm{cov}_{tr}$ , $\mathrm{cov}_{te}$ ), and synthesis time.

Model	$W_{tr}\downarrow$	$F1_{fake}\uparrow$	$\mathrm{cov}_{tr}\uparrow$	$\mathrm{time}\downarrow$
HS³F-Euler	1.283	0.738	0.787	278.4
HS³F-Rg4	1.233	0.741	0.819	331.1
Forest-Flow	1.356	0.723	0.839	1073.8

For the high-categorical subset, HS³F variants achieve $W_{tr}=0.584$ –$0.596$ and are 20.8–26.6× faster than FF. On the full benchmark, HS³F-Rg4 attains the lowest $W_{tr}$ and highest $F1_{fake}$ ; time performance is 3–4× superior compared to FF.

6. Ablation Experiments and Identified Limitations

Ablation results reflect the effect size of HS³F's individual components:

Sequential (Autoregressive) Generation: CS3F-Euler (splitting features) yields $W_{tr}:1.349\to1.283$ in comparison to FF, indicating moderate benefit from autoregression.
RK4 Integration: Switching CS3F-Euler to CS3F-Rg4 slightly degrades $W_{tr}$ , whereas for HS³F, Euler $\to$ Rg4 shifts $W_{tr}$ from 1.283→1.233, demonstrating that solver choice interacts strongly with heterogeneity.
Multinomial Sampling for Categories: Transition from CS3F-Euler ( $W_{tr}=0.926$ ) to HS³F-Euler ($0.596$) on high-categorical datasets underscores the substantial improvement from categorical modeling.

Limitations include sequential dependence: spurious early-feature correlations may propagate errors downstream. The implementation does not automatically order features causally; improvements may be achievable by incorporating feature ordering or grouping heuristics. Additionally, extending sequential flows to handle mixed continuous/categorical variables or fully differentiable discrete flows constitutes a plausible direction for future work.

7. Summary and Implications

HS³F combines per-feature conditional flow matching, RK4 ODE integration, and direct multinomial sampling for categorical features. These innovations result in synthetic tabular data that exhibit enhanced quality, robustness to ODE initialization shifts, and significant speed advantages, particularly in categorical-rich settings. A plausible implication is the accelerated generation of high-fidelity synthetic tabular data for privacy-preserving machine learning and regulatory compliance applications (Akazan et al., 2024).

PDF Markdown Chat (Pro)

References (1)

Generating Tabular Data Using Heterogeneous Sequential Feature Forest Flow Matching (2024)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Heterogeneous Sequential Feature Forest Flow (HS3F).