AttentionForest: Transformer Tabular Oversampling
- AttentionForest is a transformer-based autoencoder architecture that employs multi-head self-attention to capture high-order feature interactions in tabular data.
- It integrates a latent-space tree-driven diffusion framework using gradient-boosted trees and conditional flow matching to generate realistic and privacy-aware minority-class samples.
- Empirical evaluations demonstrate enhanced minority recall, low Wasserstein distance for sample realism, and competitive privacy metrics across healthcare, finance, and manufacturing datasets.
AttentionForest is a transformer-based autoencoder architecture integrated into a latent-space, tree-driven diffusion framework for minority-class oversampling in tabular data. Distinguished from related approaches such as PCAForest and EmbedForest by its attention-augmented embedding, AttentionForest leverages multi-head self-attention to encode high-order feature interactions in a compact latent space. This architecture is combined with a continuous-time diffusion process modeled via gradient-boosted trees (GBTs) and conditional flow matching (CFM), providing a mechanism to synthesize realistic, privacy-aware samples under severe class imbalance. Across multiple benchmark datasets in healthcare, finance, and manufacturing, AttentionForest achieves superior minority recall, robust sample realism (low Wasserstein distance), and competitive privacy metrics, marking it as a high-fidelity tabular data augmentation method (Ihsan et al., 20 Nov 2025).
1. Problem Formulation and Motivation
Class imbalance is pervasive in domains such as defect detection, fraud detection, and rare disease prediction, where the minority class critically drives predictive utility. Conventional oversampling methods—including random undersampling, SMOTE—often introduce bias or artifact by eliminating majority samples or interpolating minority instances, leading to over-fitting and information loss. Generative models (GANs, VAEs, diffusion) have somewhat alleviated this, but typically struggle with heterogeneous tabular feature types, are computationally intensive, and may inadvertently compromise privacy due to high-fidelity sample synthesis. AttentionForest addresses these limitations by synthesizing minority-class samples in a latent space designed to preserve tabular structure, enhance computational efficiency, and limit privacy risk (Ihsan et al., 20 Nov 2025).
2. Latent-Space Tree-Driven Diffusion Framework
AttentionForest is one of three variants in the latent-space tree-driven diffusion family. Samples are first embedded into a low-dimensional latent space; AttentionForest utilizes a transformer-based autoencoder, in contrast to the linear (PCAForest) or shallow nonlinear (EmbedForest) alternatives. Within this latent space, synthetic generation occurs through a reverse diffusion process, with GBTs learning the continuous vector field under CFM. Generation begins from noise and integrates the learned ordinary differential equation backward to , followed by decoding to the original feature space. This architecture enables compact per-sample computation while retaining fidelity in feature interaction modeling (Ihsan et al., 20 Nov 2025).
| Model Variant | Encoder Type | Downstream Utility |
|---|---|---|
| PCAForest | Linear PCA | Fast, lower recall |
| EmbedForest | Nonlinear AE | Intermediate utility |
| AttentionForest | Transformer AE | Highest recall, F1 |
Editor's term: AE = autoencoder.
3. Attention-Augmented Embedding Architecture
The AttentionForest encoder tokenizes tabular features as follows: numerical features are linearly projected via ; categorical features use learned embedding tables . Sinusoidal positional encodings are added to maintain feature ordering, yielding . The transformer encoder comprises stacked layers with multi-head self-attention and feed-forward blocks:
where , , are linear projections of per head; multi-head outputs are concatenated and linearly projected to form the latent . After layers, the output is (Ihsan et al., 20 Nov 2025).
4. Conditional Flow Matching and Synthetic Sample Generation
Diffusion in the latent space is formulated as a forward stochastic differential equation (SDE):
with annealing from real data at to noise at . When , the reverse process simplifies to an ODE:
Conditional flow matching uses interpolated trajectories with , minimizing
GBT regressors (typically XGBoost or equivalent) fit using real latent representations as anchors (Ihsan et al., 20 Nov 2025).
5. Decoder and Reconstruction
The reverse-diffused latent is decoded using a transformer decoder mirroring the encoder, with cross-attention and feed-forward modules. The decoder outputs embeddings for each feature, with categorical features reconstructed via and numerical ones via . The reconstruction loss is
pretrained only on real data (Ihsan et al., 20 Nov 2025).
6. Training Pipeline and Augmentation Application
The pipeline proceeds as follows:
- Data is split into 70% train and 30% real-only test subsets.
- Minority-class train examples provide the data for encoder–decoder pretraining via .
- Minority instances are encoded to ; noise samples are drawn and linearly interpolated to form flow trajectories.
- GBTs are trained to fit the flow field according to .
- Inference involves sampling , integrating backward ODE, decoding to synthetic samples , and adding these to the minority class at augmentation ratios in the range 25%–300%.
AttentionForest, due to its compact latent dimension (), yields reduced per-step computational cost. The transformer autoencoder adds computational overhead, but this is offset by gains in fidelity and minority-class recall (Ihsan et al., 20 Nov 2025).
7. Empirical Evaluation: Utility, Privacy, and Calibration
Across 11 tabular datasets comprising varied domains, AttentionForest achieves average minority recall of and F1 of for Random Forest and XGBoost classifiers, outperforming PCAForest, EmbedForest, Forest-Diffusion, SMOTE, and CTGAN. Recall gains remain stable up to 300% augmentation ratios. Statistical similarity, measured by one-dimensional Wasserstein distance (WD), finds AttentionForest at WD (vs.~CTGAN , SMOTE ), indicating realistic distributions. PCAForest achieves WD in PCA space but with reduced model capacity.
AttentionForest maintains or improves precision (–$0.49$) and calibration compared to competitive oversampling baselines. Privacy is evaluated using Distance to Closest Record (DCR) and Nearest-Neighbor Distance Ratio (NNDR); AttentionForest (DCR , NNDR ) parallels Forest-Diffusion but achieves greater sample realism. Ablation studies show smaller latent embedding dimensions improve recall, while aggressive learning rates degrade stability and utility. Optimal hyperparameters commonly include learning rate , moderate latent size, and 50–100 diffusion steps (Ihsan et al., 20 Nov 2025).
| Model | Recall | F1 | WD | DCR | NNDR |
|---|---|---|---|---|---|
| AttentionForest | ~0.46 | ~0.48 | ~33.2 | ~192.5 | ~0.68 |
| EmbedForest | - | - | ~165.7 | ~534.9 | ~0.85 |
| PCAForest | - | - | ~0.16 | ~1.8 | ~0.77 |
| CTGAN | - | - | ~352 | - | - |
8. Relation to Neural Attention Forests and Implications
While AttentionForest's nomenclature is shared with the Neural Attention Forest (NAF) framework (Konstantinov et al., 2023), the underlying methodologies differ significantly. NAF integrates attention mechanisms into random forests via learned attention weights at the leaf and forest levels, employing neural networks for scoring and aggregation. This forms a kernel regression architecture with end-to-end training where the fixed trees are enhanced with two layers of softmax-learned attention. NAF demonstrates improved predictive accuracy over classical RF/ERT on several tabular benchmarks, particularly when uniform averaging is suboptimal and local structure is present (Konstantinov et al., 2023).
AttentionForest, in contrast, deploys transformer attention as a mechanism for nonlinear latent embedding prior to tree-driven diffusion-based sample generation. Both frameworks belong to the class of "forest-transformer" architectures that combine tree-based inductive bias with neural attention for improved tabular modeling. A plausible implication is that high-order attention mechanisms further extend the capacity of forests to represent complex tabular dependencies, especially in undersampled or imbalanced settings.
9. Significance, Limitations, and Future Directions
AttentionForest advances tabular data augmentation by fusing transformer-based embeddings with latent diffusion via GBTs, offering high minority-class recall, realistic sample synthesis, and competitive privacy preservation in a unified pipeline. The method remains tunable via latent dimension and learning rate, and is empirically robust to augmentation ratio within evaluated ranges. Limitations include increased compute overhead for transformer autoencoders versus linear or shallow nonlinear variants, and sensitivity to hyperparameter choices. Future directions may include adaptive latent-dimension selection, modular integrations with alternative flow-learners, and further privacy analyses under tighter constraints (Ihsan et al., 20 Nov 2025).