Papers
Topics
Authors
Recent
Search
2000 character limit reached

Semantic Disentangled VAE (Send-VAE)

Updated 15 January 2026
  • The paper introduces Send-VAE, which robustly disentangles semantic features from latent variables to improve generative modeling.
  • It leverages a modular architecture integrated with a Vision Transformer backbone and pre-trained VAE components for precise feature extraction.
  • Empirical results demonstrate that Send-VAE achieves superior image quality and refined feature control compared to traditional VAE models.

Flow-based Transformers, specifically Scalable Interpolant Transformers (SiTs), are a new family of generative models that unify probability-flow ODEs and diffusion-based SDEs within a flexible, interpolant-driven framework. These models utilize a Vision Transformer backbone and are designed to provide modular, performant alternatives to traditional diffusion models, offering superior sample quality by optimizing both the interpolation scheme and the loss objective (Ma et al., 2024).

1. Foundation and Theoretical Framework

SiTs are constructed around the interpolant paradigm for generative modeling. They connect two endpoint distributions—typically the target data distribution pdatap_\text{data} and an auxiliary base (e.g., standard normal)—via a parameterizable stochastic interpolant: xt=αtx0+σtx1,t[0,1]x_t = \alpha_t x_0 + \sigma_t x_1, \quad t\in[0,1] where x0pdatax_0\sim p_\text{data}, x1N(0,I)x_1\sim \mathcal{N}(0,I), and (αt,σt)(\alpha_t,\sigma_t) are differentiable schedules satisfying α0=1,σ0=0\alpha_0=1,\sigma_0=0 and α1=0,σ1=1\alpha_1=0,\sigma_1=1. The interpolant determines the trajectory along which the model transports noise to data, providing a unifying family that subsumes standard DDPMs, score-based models, and rectified flows.

The induced marginal law pt(x)p_t(x) is governed by a probability-flow ODE: dxt=v(xt,t)dt,v(x,t)=α˙tE[x0xt=x]+σ˙tE[x1xt=x]d x_t = v(x_t, t) dt, \quad v(x, t) = \dot{\alpha}_t \mathbb{E}[x_0|x_t=x] + \dot{\sigma}_t \mathbb{E}[x_1|x_t=x] For generative sampling, this ODE is solved backward (from x1x_1) to recover samples from pdatap_\text{data}.

2. Architecture and Model Parameterization

SiT strictly retains the architectural setting of DiT for controlled comparisons. It uses a class-conditional Vision Transformer (ViT) with patch size 2, leveraging a pre-trained VAE as the image tokenizer (encoder/decoder). The temporal index tt and class label are injected into normalization layers via AdaLN-Zero, ensuring precise conditioning (Ma et al., 2024).

Model scale (S/B/L/XL) matches DiT: number of layers, hidden dimension, attention heads, total parameters, and arithmetic throughput (GFLOPs) remain constant. All empirical comparisons fix the VAE and data pipeline, isolating the generative mechanism.

3. Objectives and Modular Losses

The interpolant formulation admits both score-matching and velocity-matching training objectives.

  • Score matching:

Ls(θ)=01E[σtsθ(xt,t)+ϵ2]dtL_s(\theta) = \int_0^1 \mathbb{E}\left[\|\sigma_t s_\theta(x_t, t) + \epsilon\|^2 \right] dt

where sθs_\theta is a neural network estimate of the score xlogpt(x)\nabla_x \log p_t(x), and ϵ\epsilon is standard Gaussian noise sampled in the forward process.

  • Velocity matching:

Lv(θ)=01E[vθ(xt,t)(α˙tx0+σ˙tx1)2]dtL_v(\theta) = \int_0^1 \mathbb{E}\left[\|v_\theta(x_t, t) - (\dot{\alpha}_t x_0 + \dot{\sigma}_t x_1)\|^2 \right] dt

The two parameterizations are mathematically connected. Velocity and score predictions can be deterministically interconverted for a fixed interpolant, and the choice can be decoupled from the overall sampling dynamics.

Weighted-score and hybrid objectives are also explored (with weighting λt\lambda_t deriving from differentiating the interpolant coefficients), providing additional flexibility (Ma et al., 2024).

4. Sampling Dynamics and Diffusion Coefficients

Generative sampling can proceed via the deterministic ODE (Heun method) or the stochastic reverse-time SDE (Euler-Maruyama):

  • Heun (ODE):

1
2
3
4
5
6
7
x_N = N(0, I)
for i = N to 1:
  v_i = v_θ(x_i, t_i)
  x̄ = x_i + Δt v_i
  v̄ = v_θ(x̄, t_{i-1})
  x_{i-1} = x_i + (Δt/2)(v_i + v̄)
return x_0

  • Euler-Maruyama (SDE):

1
2
3
4
5
6
7
x_N = N(0, I)
for i = N to 1:
  score_i = s_θ(x_i, t_i)
  drift_i = v_θ(x_i, t_i) + 0.5 w_i score_i
  x̄ = x_i + Δt drift_i
  x_{i-1} = x̄ + sqrt(w_i Δt) * ξ, ξ ~ N(0, I)
Final ODE step to x_0

The diffusion coefficient wtw_t governs the noise level in the SDE. Unlike standard SBDMs, SiT can set wtw_t independently of the forward noising process:

  • wt=βtw_t = \beta_t (standard)
  • wt=σtw_t = \sigma_t (eliminates singularity at t0t\to 0)
  • smooth schedules wt=sin2(πt)w_t = \sin^2(\pi t), (cosπt±1)2(\cos \pi t \pm 1)^2, etc.

The best wtw_t is selected post-training for optimal FID; the framework thus decouples training and sampling, enabling extensive ablation (Ma et al., 2024).

5. Empirical Results and Scaling Laws

SiT exhibits superior generative modeling across all tested scales compared to DiT, holding constant the non-generative components.

ImageNet 256×256 FID-50K (at 400K steps, CFG=1.5):

Model Params GFLOPs DiT FID SiT FID
S 33M 250 68.4 57.6
B 130M 350 43.5 33.5
L 458M 650 23.3 18.8
XL 675M 900 19.5 17.2

Extended training of SiT-XL (7M steps) with CFG=1.5 achieves FID 2.06, surpassing DiT-XL at FID 2.27 under identical protocol (Ma et al., 2024).

Ablations show:

  • Score vs velocity parameterization: weighted-score or velocity-matching outperform unweighted score matching.
  • Choice of interpolant: linear and GVP interpolants yield better learning efficiency and lower final FID than VP.
  • Sampler choice: SDE yields superior FID to ODE; ODE converges more quickly at low sample counts.
  • GUIDANCE: Classifier-Free Guidance extends seamlessly to velocity-prediction flows.

The improvement in FID from DiT to SiT is consistently absolute rather than relative as model scale increases.

6. Modular Analyses and Generalization

The SiT framework is inherently modular:

  • Time discretization and the interpolant family can be systematically ablated.
  • Score/velocity prediction choice is a plug-in parameter.
  • Training schedule, patch tokenization, and architectural hyperparameters are fully transportable from DiT.
  • The decoupling of wtw_t enables external tuning without model retraining.

The interpolant ODE formalism unifies prior flow-based, diffusion, and rectified models. Methodologically, this permits the disentanglement of fundamental design choices, enabling robust analysis of each factor's impact on convergence and sample quality.

7. Significance and Implications

SiT demonstrates that with its unified interpolant transport and flow-matching training, a single Transformer architecture can combine the strengths of both diffusion and flow models:

  • Improved FID at fixed compute and parameter budget.
  • Flexible and modular for ablation studies.
  • Decoupled design optimizations (interpolant, sampler, diffusion schedule, etc.).

A plausible implication is that future large-scale multimodal generative models may benefit from interpolant-based flow frameworks, as these approaches eliminate the historical constraints of tied training and forward noise schedules while offering continuous improvements with scale (Ma et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Semantic Disentangled VAE (Send-VAE).