Papers
Topics
Authors
Recent
2000 character limit reached

AttentionForest: Transformer Tabular Oversampling

Updated 20 December 2025
  • AttentionForest is a transformer-based autoencoder architecture that employs multi-head self-attention to capture high-order feature interactions in tabular data.
  • It integrates a latent-space tree-driven diffusion framework using gradient-boosted trees and conditional flow matching to generate realistic and privacy-aware minority-class samples.
  • Empirical evaluations demonstrate enhanced minority recall, low Wasserstein distance for sample realism, and competitive privacy metrics across healthcare, finance, and manufacturing datasets.

AttentionForest is a transformer-based autoencoder architecture integrated into a latent-space, tree-driven diffusion framework for minority-class oversampling in tabular data. Distinguished from related approaches such as PCAForest and EmbedForest by its attention-augmented embedding, AttentionForest leverages multi-head self-attention to encode high-order feature interactions in a compact latent space. This architecture is combined with a continuous-time diffusion process modeled via gradient-boosted trees (GBTs) and conditional flow matching (CFM), providing a mechanism to synthesize realistic, privacy-aware samples under severe class imbalance. Across multiple benchmark datasets in healthcare, finance, and manufacturing, AttentionForest achieves superior minority recall, robust sample realism (low Wasserstein distance), and competitive privacy metrics, marking it as a high-fidelity tabular data augmentation method (Ihsan et al., 20 Nov 2025).

1. Problem Formulation and Motivation

Class imbalance is pervasive in domains such as defect detection, fraud detection, and rare disease prediction, where the minority class critically drives predictive utility. Conventional oversampling methods—including random undersampling, SMOTE—often introduce bias or artifact by eliminating majority samples or interpolating minority instances, leading to over-fitting and information loss. Generative models (GANs, VAEs, diffusion) have somewhat alleviated this, but typically struggle with heterogeneous tabular feature types, are computationally intensive, and may inadvertently compromise privacy due to high-fidelity sample synthesis. AttentionForest addresses these limitations by synthesizing minority-class samples in a latent space designed to preserve tabular structure, enhance computational efficiency, and limit privacy risk (Ihsan et al., 20 Nov 2025).

2. Latent-Space Tree-Driven Diffusion Framework

AttentionForest is one of three variants in the latent-space tree-driven diffusion family. Samples are first embedded into a low-dimensional latent space; AttentionForest utilizes a transformer-based autoencoder, in contrast to the linear (PCAForest) or shallow nonlinear (EmbedForest) alternatives. Within this latent space, synthetic generation occurs through a reverse diffusion process, with GBTs learning the continuous vector field vθ(t,x)v_\theta(t,x) under CFM. Generation begins from noise x(T)∼N(0,I)x(T)\sim \mathcal N(0,I) and integrates the learned ordinary differential equation backward to x(0)x(0), followed by decoding to the original feature space. This architecture enables compact per-sample computation while retaining fidelity in feature interaction modeling (Ihsan et al., 20 Nov 2025).

Model Variant Encoder Type Downstream Utility
PCAForest Linear PCA Fast, lower recall
EmbedForest Nonlinear AE Intermediate utility
AttentionForest Transformer AE Highest recall, F1

Editor's term: AE = autoencoder.

3. Attention-Augmented Embedding Architecture

The AttentionForest encoder tokenizes tabular features as follows: numerical features xnum,ix_{\text{num},i} are linearly projected via Enum,i=Wnum,i xnum,i+bnum,iE_{\text{num},i} = W_{\text{num},i}\,x_{\text{num},i} + b_{\text{num},i}; categorical features xcat,jx_{\text{cat},j} use learned embedding tables Ecat,j=Embedding(xcat,j)E_{\text{cat},j} = \mathrm{Embedding}(x_{\text{cat},j}). Sinusoidal positional encodings PP are added to maintain feature ordering, yielding z0=[Ecat ∥  Enum]+Pz_0 = [E_{\text{cat}}\,\|\;E_{\text{num}}] + P. The transformer encoder comprises LL stacked layers with multi-head self-attention and feed-forward blocks:

headk=softmax(QkKk⊤/dk) Vk\text{head}_k = \mathrm{softmax}\bigl(Q_kK_k^\top/\sqrt{d_k}\bigr)\,V_k

where QkQ_k, KkK_k, VkV_k are linear projections of zz per head; multi-head outputs are concatenated and linearly projected to form the latent zz. After LL layers, the output is L0∈Rnsamples×dlatentL_0\in\mathbb R^{n_{\text{samples}}\times d_{\text{latent}}} (Ihsan et al., 20 Nov 2025).

4. Conditional Flow Matching and Synthetic Sample Generation

Diffusion in the latent space is formulated as a forward stochastic differential equation (SDE):

dx=ut(x) dt+g(t) dwdx = u_t(x)\,dt + g(t)\,dw

with annealing from real data at x(0)∼qdatax(0)\sim q_{\text{data}} to noise at x(T)∼N(0,I)x(T)\sim \mathcal N(0,I). When g(t)=0g(t)=0, the reverse process simplifies to an ODE:

dxdt=vθ(t,x)\frac{dx}{dt} = v_\theta(t,x)

Conditional flow matching uses interpolated trajectories x(t)=(1−t)x0+tx1x(t) = (1-t)x_0 + t x_1 with x1∼N(0,I)x_1\sim\mathcal N(0,I), minimizing

Lcfm(θ)=Et,x0,x1∥vθ(t,x(t))−(x1−x0)∥2\mathcal L_{\text{cfm}}(\theta) = \mathbb E_{t,x_0,x_1}\left\| v_\theta(t,x(t))-(x_1-x_0)\right\|^2

GBT regressors (typically XGBoost or equivalent) fit vθv_\theta using real latent representations L0L_0 as anchors (Ihsan et al., 20 Nov 2025).

5. Decoder and Reconstruction

The reverse-diffused latent L0L_0 is decoded using a transformer decoder mirroring the encoder, with cross-attention and feed-forward modules. The decoder outputs embeddings for each feature, with categorical features reconstructed via x^cat,j=Softmax(Wcat,j Ecat,j+bcat,j)\hat x_{\text{cat},j} = \mathrm{Softmax}(W_{\text{cat},j}\,E_{\text{cat},j} + b_{\text{cat},j}) and numerical ones via x^num,i=Wnum,i Enum,i+bnum,i\hat x_{\text{num},i} = W_{\text{num},i}\,E_{\text{num},i} + b_{\text{num},i}. The reconstruction loss is

Lrec=∑i∥xi−x^i∥2\mathcal L_{\text{rec}} = \sum_i\|x_i - \hat x_i\|^2

pretrained only on real data (Ihsan et al., 20 Nov 2025).

6. Training Pipeline and Augmentation Application

The pipeline proceeds as follows:

  • Data is split into 70% train and 30% real-only test subsets.
  • Minority-class train examples {x0}\{x_0\} provide the data for encoder–decoder pretraining via Lrec\mathcal L_{\text{rec}}.
  • Minority instances are encoded to L0L_0; noise samples x1x_1 are drawn and linearly interpolated to form flow trajectories.
  • GBTs are trained to fit the flow field according to Lcfm\mathcal L_{\text{cfm}}.
  • Inference involves sampling L(T)∼N(0,I)L(T)\sim \mathcal N(0,I), integrating backward ODE, decoding L(0)L(0) to synthetic samples x^\hat x, and adding these to the minority class at augmentation ratios in the range 25%–300%.

AttentionForest, due to its compact latent dimension (dlatent≪Dd_{\text{latent}}\ll D), yields reduced per-step computational cost. The transformer autoencoder adds computational overhead, but this is offset by gains in fidelity and minority-class recall (Ihsan et al., 20 Nov 2025).

7. Empirical Evaluation: Utility, Privacy, and Calibration

Across 11 tabular datasets comprising varied domains, AttentionForest achieves average minority recall of ∼0.46\sim 0.46 and F1 of ∼0.48\sim 0.48 for Random Forest and XGBoost classifiers, outperforming PCAForest, EmbedForest, Forest-Diffusion, SMOTE, and CTGAN. Recall gains remain stable up to 300% augmentation ratios. Statistical similarity, measured by one-dimensional Wasserstein distance (WD), finds AttentionForest at WD ∼33.2\sim 33.2 (vs.~CTGAN ∼352\sim 352, SMOTE ∼35\sim 35), indicating realistic distributions. PCAForest achieves WD ∼0.16\sim 0.16 in PCA space but with reduced model capacity.

AttentionForest maintains or improves precision (∼0.48\sim 0.48–$0.49$) and calibration compared to competitive oversampling baselines. Privacy is evaluated using Distance to Closest Record (DCR) and Nearest-Neighbor Distance Ratio (NNDR); AttentionForest (DCR ∼192.5\sim 192.5, NNDR ∼0.68\sim 0.68) parallels Forest-Diffusion but achieves greater sample realism. Ablation studies show smaller latent embedding dimensions improve recall, while aggressive learning rates degrade stability and utility. Optimal hyperparameters commonly include learning rate 1e−31\mathrm{e}{-3}, moderate latent size, and 50–100 diffusion steps (Ihsan et al., 20 Nov 2025).

Model Recall F1 WD DCR NNDR
AttentionForest ~0.46 ~0.48 ~33.2 ~192.5 ~0.68
EmbedForest - - ~165.7 ~534.9 ~0.85
PCAForest - - ~0.16 ~1.8 ~0.77
CTGAN - - ~352 - -

8. Relation to Neural Attention Forests and Implications

While AttentionForest's nomenclature is shared with the Neural Attention Forest (NAF) framework (Konstantinov et al., 2023), the underlying methodologies differ significantly. NAF integrates attention mechanisms into random forests via learned attention weights at the leaf and forest levels, employing neural networks for scoring and aggregation. This forms a kernel regression architecture with end-to-end training where the fixed trees are enhanced with two layers of softmax-learned attention. NAF demonstrates improved predictive accuracy over classical RF/ERT on several tabular benchmarks, particularly when uniform averaging is suboptimal and local structure is present (Konstantinov et al., 2023).

AttentionForest, in contrast, deploys transformer attention as a mechanism for nonlinear latent embedding prior to tree-driven diffusion-based sample generation. Both frameworks belong to the class of "forest-transformer" architectures that combine tree-based inductive bias with neural attention for improved tabular modeling. A plausible implication is that high-order attention mechanisms further extend the capacity of forests to represent complex tabular dependencies, especially in undersampled or imbalanced settings.

9. Significance, Limitations, and Future Directions

AttentionForest advances tabular data augmentation by fusing transformer-based embeddings with latent diffusion via GBTs, offering high minority-class recall, realistic sample synthesis, and competitive privacy preservation in a unified pipeline. The method remains tunable via latent dimension and learning rate, and is empirically robust to augmentation ratio within evaluated ranges. Limitations include increased compute overhead for transformer autoencoders versus linear or shallow nonlinear variants, and sensitivity to hyperparameter choices. Future directions may include adaptive latent-dimension selection, modular integrations with alternative flow-learners, and further privacy analyses under tighter constraints (Ihsan et al., 20 Nov 2025).

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to AttentionForest.