Semantic Disentangled VAE (Send-VAE)
- The paper introduces Send-VAE, which robustly disentangles semantic features from latent variables to improve generative modeling.
- It leverages a modular architecture integrated with a Vision Transformer backbone and pre-trained VAE components for precise feature extraction.
- Empirical results demonstrate that Send-VAE achieves superior image quality and refined feature control compared to traditional VAE models.
Flow-based Transformers, specifically Scalable Interpolant Transformers (SiTs), are a new family of generative models that unify probability-flow ODEs and diffusion-based SDEs within a flexible, interpolant-driven framework. These models utilize a Vision Transformer backbone and are designed to provide modular, performant alternatives to traditional diffusion models, offering superior sample quality by optimizing both the interpolation scheme and the loss objective (Ma et al., 2024).
1. Foundation and Theoretical Framework
SiTs are constructed around the interpolant paradigm for generative modeling. They connect two endpoint distributions—typically the target data distribution and an auxiliary base (e.g., standard normal)—via a parameterizable stochastic interpolant: where , , and are differentiable schedules satisfying and . The interpolant determines the trajectory along which the model transports noise to data, providing a unifying family that subsumes standard DDPMs, score-based models, and rectified flows.
The induced marginal law is governed by a probability-flow ODE: For generative sampling, this ODE is solved backward (from ) to recover samples from .
2. Architecture and Model Parameterization
SiT strictly retains the architectural setting of DiT for controlled comparisons. It uses a class-conditional Vision Transformer (ViT) with patch size 2, leveraging a pre-trained VAE as the image tokenizer (encoder/decoder). The temporal index and class label are injected into normalization layers via AdaLN-Zero, ensuring precise conditioning (Ma et al., 2024).
Model scale (S/B/L/XL) matches DiT: number of layers, hidden dimension, attention heads, total parameters, and arithmetic throughput (GFLOPs) remain constant. All empirical comparisons fix the VAE and data pipeline, isolating the generative mechanism.
3. Objectives and Modular Losses
The interpolant formulation admits both score-matching and velocity-matching training objectives.
- Score matching:
where is a neural network estimate of the score , and is standard Gaussian noise sampled in the forward process.
- Velocity matching:
The two parameterizations are mathematically connected. Velocity and score predictions can be deterministically interconverted for a fixed interpolant, and the choice can be decoupled from the overall sampling dynamics.
Weighted-score and hybrid objectives are also explored (with weighting deriving from differentiating the interpolant coefficients), providing additional flexibility (Ma et al., 2024).
4. Sampling Dynamics and Diffusion Coefficients
Generative sampling can proceed via the deterministic ODE (Heun method) or the stochastic reverse-time SDE (Euler-Maruyama):
- Heun (ODE):
1 2 3 4 5 6 7 |
x_N = N(0, I)
for i = N to 1:
v_i = v_θ(x_i, t_i)
x̄ = x_i + Δt v_i
v̄ = v_θ(x̄, t_{i-1})
x_{i-1} = x_i + (Δt/2)(v_i + v̄)
return x_0 |
- Euler-Maruyama (SDE):
1 2 3 4 5 6 7 |
x_N = N(0, I)
for i = N to 1:
score_i = s_θ(x_i, t_i)
drift_i = v_θ(x_i, t_i) + 0.5 w_i score_i
x̄ = x_i + Δt drift_i
x_{i-1} = x̄ + sqrt(w_i Δt) * ξ, ξ ~ N(0, I)
Final ODE step to x_0 |
The diffusion coefficient governs the noise level in the SDE. Unlike standard SBDMs, SiT can set independently of the forward noising process:
- (standard)
- (eliminates singularity at )
- smooth schedules , , etc.
The best is selected post-training for optimal FID; the framework thus decouples training and sampling, enabling extensive ablation (Ma et al., 2024).
5. Empirical Results and Scaling Laws
SiT exhibits superior generative modeling across all tested scales compared to DiT, holding constant the non-generative components.
ImageNet 256×256 FID-50K (at 400K steps, CFG=1.5):
| Model | Params | GFLOPs | DiT FID | SiT FID |
|---|---|---|---|---|
| S | 33M | 250 | 68.4 | 57.6 |
| B | 130M | 350 | 43.5 | 33.5 |
| L | 458M | 650 | 23.3 | 18.8 |
| XL | 675M | 900 | 19.5 | 17.2 |
Extended training of SiT-XL (7M steps) with CFG=1.5 achieves FID 2.06, surpassing DiT-XL at FID 2.27 under identical protocol (Ma et al., 2024).
Ablations show:
- Score vs velocity parameterization: weighted-score or velocity-matching outperform unweighted score matching.
- Choice of interpolant: linear and GVP interpolants yield better learning efficiency and lower final FID than VP.
- Sampler choice: SDE yields superior FID to ODE; ODE converges more quickly at low sample counts.
- GUIDANCE: Classifier-Free Guidance extends seamlessly to velocity-prediction flows.
The improvement in FID from DiT to SiT is consistently absolute rather than relative as model scale increases.
6. Modular Analyses and Generalization
The SiT framework is inherently modular:
- Time discretization and the interpolant family can be systematically ablated.
- Score/velocity prediction choice is a plug-in parameter.
- Training schedule, patch tokenization, and architectural hyperparameters are fully transportable from DiT.
- The decoupling of enables external tuning without model retraining.
The interpolant ODE formalism unifies prior flow-based, diffusion, and rectified models. Methodologically, this permits the disentanglement of fundamental design choices, enabling robust analysis of each factor's impact on convergence and sample quality.
7. Significance and Implications
SiT demonstrates that with its unified interpolant transport and flow-matching training, a single Transformer architecture can combine the strengths of both diffusion and flow models:
- Improved FID at fixed compute and parameter budget.
- Flexible and modular for ablation studies.
- Decoupled design optimizations (interpolant, sampler, diffusion schedule, etc.).
A plausible implication is that future large-scale multimodal generative models may benefit from interpolant-based flow frameworks, as these approaches eliminate the historical constraints of tied training and forward noise schedules while offering continuous improvements with scale (Ma et al., 2024).