Semantic Disentangled VAE (Send-VAE)

Updated 15 January 2026

The paper introduces Send-VAE, which robustly disentangles semantic features from latent variables to improve generative modeling.
It leverages a modular architecture integrated with a Vision Transformer backbone and pre-trained VAE components for precise feature extraction.
Empirical results demonstrate that Send-VAE achieves superior image quality and refined feature control compared to traditional VAE models.

Flow-based Transformers, specifically Scalable Interpolant Transformers (SiTs), are a new family of generative models that unify probability-flow ODEs and diffusion-based SDEs within a flexible, interpolant-driven framework. These models utilize a Vision Transformer backbone and are designed to provide modular, performant alternatives to traditional diffusion models, offering superior sample quality by optimizing both the interpolation scheme and the loss objective (Ma et al., 2024).

1. Foundation and Theoretical Framework

SiTs are constructed around the interpolant paradigm for generative modeling. They connect two endpoint distributions—typically the target data distribution $p_\text{data}$ and an auxiliary base (e.g., standard normal)—via a parameterizable stochastic interpolant: $x_t = \alpha_t x_0 + \sigma_t x_1, \quad t\in[0,1]$ where $x_0\sim p_\text{data}$ , $x_1\sim \mathcal{N}(0,I)$ , and $(\alpha_t,\sigma_t)$ are differentiable schedules satisfying $\alpha_0=1,\sigma_0=0$ and $\alpha_1=0,\sigma_1=1$ . The interpolant determines the trajectory along which the model transports noise to data, providing a unifying family that subsumes standard DDPMs, score-based models, and rectified flows.

The induced marginal law $p_t(x)$ is governed by a probability-flow ODE: $d x_t = v(x_t, t) dt, \quad v(x, t) = \dot{\alpha}_t \mathbb{E}[x_0|x_t=x] + \dot{\sigma}_t \mathbb{E}[x_1|x_t=x]$ For generative sampling, this ODE is solved backward (from $x_1$ ) to recover samples from $x_t = \alpha_t x_0 + \sigma_t x_1, \quad t\in[0,1]$ 0.

2. Architecture and Model Parameterization

SiT strictly retains the architectural setting of DiT for controlled comparisons. It uses a class-conditional Vision Transformer (ViT) with patch size 2, leveraging a pre-trained VAE as the image tokenizer (encoder/decoder). The temporal index $x_t = \alpha_t x_0 + \sigma_t x_1, \quad t\in[0,1]$ 1 and class label are injected into normalization layers via AdaLN-Zero, ensuring precise conditioning (Ma et al., 2024).

Model scale (S/B/L/XL) matches DiT: number of layers, hidden dimension, attention heads, total parameters, and arithmetic throughput (GFLOPs) remain constant. All empirical comparisons fix the VAE and data pipeline, isolating the generative mechanism.

3. Objectives and Modular Losses

The interpolant formulation admits both score-matching and velocity-matching training objectives.

Score matching:

$x_t = \alpha_t x_0 + \sigma_t x_1, \quad t\in[0,1]$ 2

where $x_t = \alpha_t x_0 + \sigma_t x_1, \quad t\in[0,1]$ 3 is a neural network estimate of the score $x_t = \alpha_t x_0 + \sigma_t x_1, \quad t\in[0,1]$ 4, and $x_t = \alpha_t x_0 + \sigma_t x_1, \quad t\in[0,1]$ 5 is standard Gaussian noise sampled in the forward process.

Velocity matching:

$x_t = \alpha_t x_0 + \sigma_t x_1, \quad t\in[0,1]$ 6

The two parameterizations are mathematically connected. Velocity and score predictions can be deterministically interconverted for a fixed interpolant, and the choice can be decoupled from the overall sampling dynamics.

Weighted-score and hybrid objectives are also explored (with weighting $x_t = \alpha_t x_0 + \sigma_t x_1, \quad t\in[0,1]$ 7 deriving from differentiating the interpolant coefficients), providing additional flexibility (Ma et al., 2024).

4. Sampling Dynamics and Diffusion Coefficients

Generative sampling can proceed via the deterministic ODE (Heun method) or the stochastic reverse-time SDE (Euler-Maruyama):

Heun (ODE):

$x_0\sim p_\text{data}$ 7

Euler-Maruyama (SDE):

$x_0\sim p_\text{data}$ 8

The diffusion coefficient $x_t = \alpha_t x_0 + \sigma_t x_1, \quad t\in[0,1]$ 8 governs the noise level in the SDE. Unlike standard SBDMs, SiT can set $x_t = \alpha_t x_0 + \sigma_t x_1, \quad t\in[0,1]$ 9 independently of the forward noising process:

$x_0\sim p_\text{data}$ 0 (standard)
$x_0\sim p_\text{data}$ 1 (eliminates singularity at $x_0\sim p_\text{data}$ 2)
smooth schedules $x_0\sim p_\text{data}$ 3, $x_0\sim p_\text{data}$ 4, etc.

The best $x_0\sim p_\text{data}$ 5 is selected post-training for optimal FID; the framework thus decouples training and sampling, enabling extensive ablation (Ma et al., 2024).

5. Empirical Results and Scaling Laws

SiT exhibits superior generative modeling across all tested scales compared to DiT, holding constant the non-generative components.

ImageNet 256×256 FID-50K (at 400K steps, CFG=1.5):

Model	Params	GFLOPs	DiT FID	SiT FID
S	33M	250	68.4	57.6
B	130M	350	43.5	33.5
L	458M	650	23.3	18.8
XL	675M	900	19.5	17.2

Extended training of SiT-XL (7M steps) with CFG=1.5 achieves FID 2.06, surpassing DiT-XL at FID 2.27 under identical protocol (Ma et al., 2024).

Ablations show:

Score vs velocity parameterization: weighted-score or velocity-matching outperform unweighted score matching.
Choice of interpolant: linear and GVP interpolants yield better learning efficiency and lower final FID than VP.
Sampler choice: SDE yields superior FID to ODE; ODE converges more quickly at low sample counts.
GUIDANCE: Classifier-Free Guidance extends seamlessly to velocity-prediction flows.

The improvement in FID from DiT to SiT is consistently absolute rather than relative as model scale increases.

6. Modular Analyses and Generalization

The SiT framework is inherently modular:

Time discretization and the interpolant family can be systematically ablated.
Score/velocity prediction choice is a plug-in parameter.
Training schedule, patch tokenization, and architectural hyperparameters are fully transportable from DiT.
The decoupling of $x_0\sim p_\text{data}$ 6 enables external tuning without model retraining.

The interpolant ODE formalism unifies prior flow-based, diffusion, and rectified models. Methodologically, this permits the disentanglement of fundamental design choices, enabling robust analysis of each factor's impact on convergence and sample quality.

7. Significance and Implications

SiT demonstrates that with its unified interpolant transport and flow-matching training, a single Transformer architecture can combine the strengths of both diffusion and flow models:

Improved FID at fixed compute and parameter budget.
Flexible and modular for ablation studies.
Decoupled design optimizations (interpolant, sampler, diffusion schedule, etc.).

A plausible implication is that future large-scale multimodal generative models may benefit from interpolant-based flow frameworks, as these approaches eliminate the historical constraints of tied training and forward noise schedules while offering continuous improvements with scale (Ma et al., 2024).

Markdown Report Issue Upgrade to Chat

References (1)

SiT: Exploring Flow and Diffusion-based Generative Models with Scalable Interpolant Transformers (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Semantic Disentangled VAE (Send-VAE).

Semantic Disentangled VAE (Send-VAE)

1. Foundation and Theoretical Framework

2. Architecture and Model Parameterization

3. Objectives and Modular Losses

4. Sampling Dynamics and Diffusion Coefficients

5. Empirical Results and Scaling Laws

ImageNet 256×256 FID-50K (at 400K steps, CFG=1.5):

6. Modular Analyses and Generalization

7. Significance and Implications

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Semantic Disentangled VAE (Send-VAE)

1. Foundation and Theoretical Framework

2. Architecture and Model Parameterization

3. Objectives and Modular Losses

4. Sampling Dynamics and Diffusion Coefficients

5. Empirical Results and Scaling Laws

ImageNet 256×256 FID-50K (at 400K steps, CFG=1.5):

6. Modular Analyses and Generalization

7. Significance and Implications

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research