Papers
Topics
Authors
Recent
2000 character limit reached

Vision Bridge Transformer (ViBT)

Updated 1 December 2025
  • ViBT is a generative model that directly maps source to target latent representations via a Brownian Bridge stochastic process.
  • It employs a variance-stabilized velocity objective and a variance-corrected sampling scheme to enhance training stability and output quality.
  • Scalable to 20B and 1.3B parameters, ViBT excels in conditional tasks such as image editing, video stylization, and depth-to-video translation.

Vision Bridge Transformer (ViBT) is a large-scale instantiation of Brownian Bridge Models for conditional vision generation tasks. Unlike the traditional “noise-to-vision” paradigm of diffusion models, ViBT models the stochastic trajectory directly between a source latent x0x_0 and a target latent x1x_1, embracing a data-to-data translation approach. This is accomplished through a Brownian Bridge stochastic process, parameterized by a Transformer architecture and governed by a variance-stabilized velocity-matching objective and a variance-corrected sampling scheme. ViBT models are scaled to 20B parameters for image editing and 1.3B parameters for video translation, demonstrating empirical advances in efficiency and quality across multiple conditional vision tasks (Tan et al., 28 Nov 2025).

1. Conceptual Foundations and Motivation

Conventional diffusion models require iteratively denoising samples from a Gaussian prior x1N(0,I)x_1 \sim \mathcal{N}(0, I) towards the data distribution pdata(x0)p_{\rm data}(x_0). Conditioning (e.g., source image, text prompt) is typically injected via auxiliary tokens or cross-attention mechanisms, which is computationally expensive for high-resolution images or extended videos.

ViBT instead directly learns a stochastic mapping from a source to a target latent representation via a Brownian Bridge, fundamentally shifting to a “vision-to-vision” translation paradigm. This approach naturally maintains strong correlations between endpoints, making it inherently suitable for conditioned generation tasks such as instruction-based image editing, style transfer, frame interpolation, and depth-to-video translation. The Brownian Bridge process is conditioned explicitly on (x0,x1)(x_0, x_1), enabling direct and efficient modeling of semantic or structural transformations (Tan et al., 28 Nov 2025).

2. Mathematical Formulation of the ViBT Brownian Bridge

The generative process is framed as a stochastic differential equation (SDE):

dXt=vθ(Xt,t)dt+σ(t)dWt,t[0,1],dX_t = v_\theta(X_t, t)\, dt + \sigma(t)\, dW_t,\quad t \in [0,1],

with boundary conditions X0psourceX_0 \sim p_{\rm source}, X1ptargetX_1 \sim p_{\rm target}, where WtW_t is standard Brownian motion, vθv_\theta is a learned velocity field, and σ(t)\sigma(t) is a (possibly time-varying) diffusion coefficient. For the Brownian Bridge specialization, σ(t)1\sigma(t) \equiv 1.

Given endpoints (x0,x1)(x_0, x_1), the marginal at time tt is

Xt(x0,x1)N((1t)x0+tx1,t(1t)I).X_t \mid (x_0, x_1) \sim \mathcal{N}((1-t)x_0 + t x_1,\, t(1-t) I).

The drift at time tt is

ut(Xtx0,x1)=x1Xt1t,u_t(X_t \mid x_0, x_1) = \frac{x_1 - X_t}{1-t},

which is parameterized by the transformer network as vθ(Xt,t)ut(Xtx0,x1)v_\theta(X_t, t) \approx u_t(X_t \mid x_0, x_1). This construction ensures the model’s denoising dynamics derive directly from the conditional trajectory between data pairs (Tan et al., 28 Nov 2025).

The discretized sampling scheme corrects for the bridge’s vanishing variance as t1t \to 1:

xk+1=xk+Δtkvθ(xk,tk)+Δtk1tk+11tkϵk,x_{k+1} = x_k + \Delta t_k\, v_\theta(x_k, t_k) + \sqrt{\Delta t_k\, \frac{1-t_{k+1}}{1-t_k}}\, \epsilon_k,

matching the bridge covariance exactly, in contrast to naive Euler–Maruyama updates which result in sampling artifacts.

3. Training Objectives and Stabilization Techniques

Directly minimizing the expected squared error between the model drift and the true bridge drift,

Lvel=E[vθ(Xt,t)ut(Xtx0,x1)2],\mathcal{L}_\text{vel} = \mathbb{E} \left[ \|v_\theta(X_t, t) - u_t(X_t \mid x_0, x_1)\|^2 \right],

can induce divergence as t1t \to 1 due to the (1t)1(1-t)^{-1} scaling. Displacement matching (vs. the endpoint difference) avoids this, but underweights late-timestep dynamics. To address this, ViBT introduces a variance stabilization:

u~t=utα(x0,x1,t),α2=1+tD(1t)x1x02,\tilde{u}_t = \frac{u_t}{\alpha(x_0, x_1, t)},\quad \alpha^2 = 1 + \frac{t D}{(1-t)\|x_1 - x_0\|^2},

where DD is the latent dimension. The stabilized loss is

Lstab=Ex0,x1,t,ϵvθ(xt,t)ut(xtx1)α(x0,x1,t)2,\mathcal{L}_\text{stab} = \mathbb{E}_{x_0, x_1, t, \epsilon} \left\| \frac{v_\theta(x_t, t) - u_t(x_t \mid x_1)}{\alpha(x_0, x_1, t)} \right\|^2,

with sample xt=(1t)x0+tx1+t(1t)ϵx_t = (1-t)x_0 + t x_1 + \sqrt{t(1-t)}\, \epsilon.

This objective balances variance across timesteps, ensuring stable training and effective coverage of the conditional interpolation dynamics (Tan et al., 28 Nov 2025).

4. Transformer Architecture and Large-Scale Scaling

ViBT employs a DiT-style latent transformer backbone with the following structure:

  • Patch embedding followed by LL transformer blocks and patch-unembedding.
  • Multi-head self-attention, feed-forward MLPs, and explicit timestep conditioning via sinusoidal embeddings and FiLM layers.
  • Conditioning on both x0x_0 and x1x_1 is handled natively inside self-attention: projections jointly attend to xtx_t and endpoint features, avoiding token length blow-up seen in conditional DiT models.

Model parameters are summarized as follows:

Task Parameters Layers (LL) Hidden Dim Heads MLP Dim Conditioning Fine-tuning
Image Editing 20B 64 4096 64 16,384 Endpoints LoRA adapters (rank 128)
Video Generation 1.3B 24 1024 16 4096 Endpoints Fully fine-tuned

No projection layer sharing is performed across depth; projection weights are shared across spatial positions only. Training uses the Prodigy optimizer, mixed-precision (FP16), and batch size of 1 latent pair per device, without explicit weight decay (LoRA providing implicit regularization for 20B models) (Tan et al., 28 Nov 2025).

5. Empirical Performance on Conditional Vision Tasks

ViBT achieves state-of-the-art or near state-of-the-art results across several conditional vision domains:

  • Instruction-based Image Editing (ImgEdit-Bench): ViBT (s=0.5) achieves an average score of 3.76, outperforming InstructPix2Pix, Step1X-Edit, FLUX.1 Kontext, UniWorld, and closely matching Qwen-Image-Editing 20B. Performance is enhanced on “Add” and “Style” edit types; qualitative inspection confirms alignment to instructions with preservation of detail.
  • Video Stylization: On a Ditto-1M subset, ViBT yields NIQE=4.328, TOPIQ=0.503, MUSIQ=64.045, MANIQA=0.348, CLIPIQA=0.486, CLIPScore=0.782. Outperforms TokenFlow, InsV2V, and RAVE on these metrics.
  • Depth-to-Video Translation: Benchmarked on VBench, ViBT attains SSIM=0.429, PSNR=11.403, NIQE=4.896, DISTS=0.230, CLIPScore=0.781, VBench=0.71, exceeding the results of ControlVideo, Control-A-Video, VideoComposer, and Wan Fun.
  • Additional tasks: Video colorization and frame interpolation are completed within only 4–8 inference steps, demonstrating fast inference.

A notable efficiency is achieved via token count halving (e.g., 4096 vs. 8192 at 1024×1024), yielding per-step speedups of 2.3× (images) and 3–4× (videos) relative to conditional DiT models (Tan et al., 28 Nov 2025).

6. Ablation Studies and Comparative Analysis

Several ablation studies highlight the design choices:

  • Loss Functions: The stabilized velocity objective consistently outperforms displacement and naive velocity objectives for both image and video tasks.
  • Noise Scale: For video tasks, noise scale s=1s=1 or s=2s=2 is optimal; for image editing, s=0.5s=0.5 yields superior results. Extreme ss values degrade quality.
  • Sampling: The variance-corrected sampling scheme produces cleaner outputs compared to naive Euler–Maruyama, eliminating sampling artifacts.
  • Timestep Schedule and Inference Steps: A shift γ=5\gamma=5 in the timestep schedule and 8–16 inference steps balance speed and quality effectively.

In tasks focused on efficiency, ViBT matches or surpasses traditional diffusion transformers, contingent on use of the variance-stabilized objective and variance-corrected sampling approach (Tan et al., 28 Nov 2025).

7. Broader Context and Future Directions

ViBT demonstrates that large-scale Brownian Bridge models can match or exceed diffusion transformer benchmarks in conditional generation, with substantial gains in computational efficiency. The direct data-to-data stochastic modeling paradigm natively supports structured conditionings while minimizing memory and compute overhead. The integration of a stabilized objective function and strictly variance-aware sampling is central to scaling and performance.

This suggests further scaling of bridge-based generative models and variants of the variance stabilization objective may generalize to a wider array of multimodal and high-resolution conditional tasks. A plausible implication is that bridge-based SDEs may supersede or augment current diffusion architectures in resource-constrained or highly-conditional generative vision pipelines (Tan et al., 28 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Vision Bridge Transformer (ViBT).