Variance-Stabilized Velocity Matching

Updated 1 December 2025

The paper introduces a variance-stabilized velocity matching objective that analytically normalizes target magnitudes to mitigate gradient instabilities.
It employs a conditional normalization factor based on endpoint statistics to balance training signals across all timesteps.
Empirical evaluations demonstrate smoother convergence and improved metrics like SSIM and PSNR in large-scale Bridge Models.

A variance-stabilized velocity-matching objective is a loss formulation used in training conditional generative models—particularly large-scale Bridge Models such as the Vision Bridge Transformer (ViBT)—that addresses severe gradient instabilities present in standard velocity-matching approaches. By analytically normalizing the velocity targets according to their conditional variance, this objective ensures well-conditioned gradients and balanced training signal across the entire time interval, which is critical for robust and scalable learning in data-to-data translation and instruction-based image/video editing tasks (Tan et al., 28 Nov 2025).

1. Foundation: Brownian Bridge Models and Velocity Matching

Conditional Bridge Models model a stochastic process $X_t$ defined on $t \in [0,1]$ via an SDE:

$dX_t = v(X_t, t)dt + \sigma(t)dW_t,$

with initial ( $X_0\sim p_0$ ) and terminal ( $X_1\sim p_1$ ) endpoint constraints. In velocity-matching, a neural parameterization $v_\theta$ is trained to approximate a “teacher” instantaneous velocity:

$u_t(X_t | x_0, x_1) = \partial_t x_t^{(\text{teacher})}$

For the Brownian bridge case ( $\sigma(t)\equiv1$ ), synthesis proceeds via:

$X_t = (1-t)x_0 + t x_1 + \sqrt{t(1-t)}\,\epsilon, \quad \epsilon \sim \mathcal{N}(0,I)$

The canonical target velocity is:

$u_t(X_t|x_0,x_1) = \frac{x_1 - X_t}{1-t}$

The naive velocity-matching loss,

$\mathcal{L}_{vm}(\theta) = \mathbb{E}\Big[\|v_\theta(X_t, t) - u_t(X_t|x_0, x_1)\|^2\Big],$

anchors the learning process.

2. Instabilities in Standard Objectives

The unnormalized velocity target diverges as $t \to 1$ :

$u_t(X_t|x_0,x_1) \sim \mathcal{O}((1-t)^{-1}),$

producing gradient explosions at late timesteps. This results in numeric instability, dominating loss contributions near $t=1$ , and severely undertraining the model elsewhere. Displacement-based alternatives,

$d_t(X_t|x_0,x_1) = x_1 - X_t,$

suffer from vanishing targets as $t \to 1$ , over-weighting early timesteps and yielding the converse imbalance (Tan et al., 28 Nov 2025).

3. Derivation of the Variance-Stabilized Velocity-Matching Objective

To correct these pathologies, the stabilized objective normalizes $u_t$ by its conditional root-mean-square magnitude:

Compute the conditional second moment:

$\mathbb{E}_\epsilon[\|u_t\|^2] = \|x_1 - x_0\|^2 + \frac{t}{1-t}D,$

where $D$ is data dimensionality.

Define the time- and endpoint-dependent normalization factor:

$\alpha(x_0,x_1,t)^2 = 1 + \frac{t D}{(1-t)\|x_1-x_0\|^2}$

The stabilized velocity target:

$\tilde{u}_t(X_t|x_0,x_1) = u_t(X_t|x_0,x_1) / \alpha(x_0,x_1,t)$

The normalized model output:

$\tilde{v}_\theta(X_t,t) = v_\theta(X_t,t) / \alpha(x_0,x_1,t)$

The variance-stabilized loss:

$\mathcal{L}_{\mathrm{stab}}(\theta) = \mathbb{E}_{x_0,x_1,\,t,\,\epsilon} \Bigl[ \bigl\|\tilde{v}_\theta(x_t,t)-\tilde{u}_t(x_t|x_0,x_1)\bigr\|^2 \Bigr]$

In all cases, the model continues to predict unnormalized velocities; normalization is applied only inside the loss function for training (Tan et al., 28 Nov 2025).

4. Analysis: Advantages of Variance Stabilization

Variance normalization offers multiple concrete benefits:

Uniform gradient magnitudes: Gradient norms are stabilized across $t$ , precluding numerical blowup for $t \to 1$ .
Equalized training signal: The expected squared norm of the target, $S(t)=\mathbb{E}[\|\tilde{u}_t\|^2]$ , is nearly flat in $t$ , ensuring balanced coverage of early, mid, and late timesteps.
Scalability: In deep Transformer models, the elimination of large-magnitude targets prevents excessive gradient clipping and optimizer pathologies, supporting robust training at the 1.3–20B parameter scale (Tan et al., 28 Nov 2025).

5. Implementation Details

The stabilized velocity objective is implemented with the following procedure:

Sample endpoint pairs $(x_0,x_1)$ .
Draw $t\sim\mathrm{Uniform}[0,1]$ and $\epsilon\sim\mathcal{N}(0,I)$ .
Generate $x_t = (1-t)x_0 + t x_1 + \sqrt{t(1-t)}\,\epsilon$ .
Compute the raw velocity, $u_t = (x_1 - x_t)/(1-t)$ .
Compute $\alpha^2 = 1 + (t D)/[(1-t)\|x_1-x_0\|^2]$ .
Compute $\tilde{v}_\theta = v_\theta(x_t,t)/\alpha$ , $\tilde{u}_t = u_t/\alpha$ .
Formulate the MSE loss, $\|\tilde{v}_\theta - \tilde{u}_t\|^2$ .
Update $\theta$ using the batch mean of the above loss.

At inference, Euler–Maruyama integration uses a variance-corrected noise:

$x_{k+1} = x_k + \Delta t_k v_\theta(x_k,t_k) + \sqrt{\Delta t_k \cdot \frac{1-t_{k+1}}{1-t_k}}\,\epsilon_k$

(Tan et al., 28 Nov 2025).

6. Empirical Evaluation and Ablations

Empirical results demonstrate clear superiority over unnormalized alternatives:

Metrics: The stabilized objective outperforms displacement and raw velocity objectives on SSIM, PSNR, NIQE, CLIP Score, and VBench for image editing and depth-to-video tasks (Table 7).
Smoother convergence: Training curves exhibit markedly reduced loss variance and improved stability (Fig. 7a).
Balanced loss contributions: Loss profile $S(t)$ is flat across $t$ for the stabilized objective, preventing endpoint dominance (Fig. 2).
Objective ablation: Stabilized velocity achieves the highest average image-edit score ($3.55$) and VBench score ($0.709$), compared with displacement ($3.50$, $0.695$) and raw velocity ($3.36$, $0.698$) objectives.
Robustness to global noise-scale $s$ is observed when combining stabilization with noise-scale tuning (Table 8) (Tan et al., 28 Nov 2025).

Objective	Pathology	Loss dominance
Raw velocity	Exploding as $t \to 1$	Endpoints, late $t$
Displacement	Vanishing as $t \to 1$	Early $t$
Variance-stabilized velocity	None (by construction)	Uniform in $t$
Denoising score matching	Vanishing/exploding at ends	Time-dependent, needs reweighting

Variance stabilization directly addresses the pathologies of both displacement and raw velocity approaches. Score-matching objectives in diffusion (notably DSM) similarly suffer from time-dependent magnitude issues unless properly reweighted, which the stabilized velocity-matching loss resolves analytically (Tan et al., 28 Nov 2025).

8. Broader Context and Connections

Variance stabilization in velocity-matching addresses a subclass of variance pathologies that also arise in broader score-matching and minimum-velocity learning settings (e.g., DSM, CD-1, Wasserstein minimum velocity) (Wang et al., 2020). The analytic normalization parallels the control variate strategies used in Wasserstein minimum velocity estimation, underscoring the general importance of variance control for stable and scalable training in generative models. A plausible implication is that analogous normalization can benefit related objectives where target magnitudes are analytically tractable and amenable to similar stabilizing transformations.

The variance-stabilized velocity-matching objective is essential for scaling conditional generative models with trajectory-based training to billion-parameter regimes, serving as a principled correction to gradient imbalances endemic in prior, unnormalized velocity-matching and displacement-matching approaches (Tan et al., 28 Nov 2025).

PDF Markdown Chat (Pro)

References (2)

Vision Bridge Transformer at Scale (2025)

A Wasserstein Minimum Velocity Approach to Learning Unnormalized Models (2020)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Variance-Stabilized Velocity-Matching Objective.