Vision Bridge Transformer (ViBT)

Updated 1 December 2025

ViBT is a generative model that directly maps source to target latent representations via a Brownian Bridge stochastic process.
It employs a variance-stabilized velocity objective and a variance-corrected sampling scheme to enhance training stability and output quality.
Scalable to 20B and 1.3B parameters, ViBT excels in conditional tasks such as image editing, video stylization, and depth-to-video translation.

Vision Bridge Transformer (ViBT) is a large-scale instantiation of Brownian Bridge Models for conditional vision generation tasks. Unlike the traditional “noise-to-vision” paradigm of diffusion models, ViBT models the stochastic trajectory directly between a source latent $x_0$ and a target latent $x_1$ , embracing a data-to-data translation approach. This is accomplished through a Brownian Bridge stochastic process, parameterized by a Transformer architecture and governed by a variance-stabilized velocity-matching objective and a variance-corrected sampling scheme. ViBT models are scaled to 20B parameters for image editing and 1.3B parameters for video translation, demonstrating empirical advances in efficiency and quality across multiple conditional vision tasks (Tan et al., 28 Nov 2025).

1. Conceptual Foundations and Motivation

Conventional diffusion models require iteratively denoising samples from a Gaussian prior $x_1 \sim \mathcal{N}(0, I)$ towards the data distribution $p_{\rm data}(x_0)$ . Conditioning (e.g., source image, text prompt) is typically injected via auxiliary tokens or cross-attention mechanisms, which is computationally expensive for high-resolution images or extended videos.

ViBT instead directly learns a stochastic mapping from a source to a target latent representation via a Brownian Bridge, fundamentally shifting to a “vision-to-vision” translation paradigm. This approach naturally maintains strong correlations between endpoints, making it inherently suitable for conditioned generation tasks such as instruction-based image editing, style transfer, frame interpolation, and depth-to-video translation. The Brownian Bridge process is conditioned explicitly on $(x_0, x_1)$ , enabling direct and efficient modeling of semantic or structural transformations (Tan et al., 28 Nov 2025).

2. Mathematical Formulation of the ViBT Brownian Bridge

The generative process is framed as a stochastic differential equation (SDE):

$dX_t = v_\theta(X_t, t)\, dt + \sigma(t)\, dW_t,\quad t \in [0,1],$

with boundary conditions $X_0 \sim p_{\rm source}$ , $X_1 \sim p_{\rm target}$ , where $W_t$ is standard Brownian motion, $v_\theta$ is a learned velocity field, and $\sigma(t)$ is a (possibly time-varying) diffusion coefficient. For the Brownian Bridge specialization, $\sigma(t) \equiv 1$ .

Given endpoints $(x_0, x_1)$ , the marginal at time $t$ is

$X_t \mid (x_0, x_1) \sim \mathcal{N}((1-t)x_0 + t x_1,\, t(1-t) I).$

The drift at time $t$ is

$u_t(X_t \mid x_0, x_1) = \frac{x_1 - X_t}{1-t},$

which is parameterized by the transformer network as $v_\theta(X_t, t) \approx u_t(X_t \mid x_0, x_1)$ . This construction ensures the model’s denoising dynamics derive directly from the conditional trajectory between data pairs (Tan et al., 28 Nov 2025).

The discretized sampling scheme corrects for the bridge’s vanishing variance as $t \to 1$ :

$x_{k+1} = x_k + \Delta t_k\, v_\theta(x_k, t_k) + \sqrt{\Delta t_k\, \frac{1-t_{k+1}}{1-t_k}}\, \epsilon_k,$

matching the bridge covariance exactly, in contrast to naive Euler–Maruyama updates which result in sampling artifacts.

3. Training Objectives and Stabilization Techniques

Directly minimizing the expected squared error between the model drift and the true bridge drift,

$\mathcal{L}_\text{vel} = \mathbb{E} \left[ \|v_\theta(X_t, t) - u_t(X_t \mid x_0, x_1)\|^2 \right],$

can induce divergence as $t \to 1$ due to the $(1-t)^{-1}$ scaling. Displacement matching (vs. the endpoint difference) avoids this, but underweights late-timestep dynamics. To address this, ViBT introduces a variance stabilization:

$\tilde{u}_t = \frac{u_t}{\alpha(x_0, x_1, t)},\quad \alpha^2 = 1 + \frac{t D}{(1-t)\|x_1 - x_0\|^2},$

where $D$ is the latent dimension. The stabilized loss is

$\mathcal{L}_\text{stab} = \mathbb{E}_{x_0, x_1, t, \epsilon} \left\| \frac{v_\theta(x_t, t) - u_t(x_t \mid x_1)}{\alpha(x_0, x_1, t)} \right\|^2,$

with sample $x_t = (1-t)x_0 + t x_1 + \sqrt{t(1-t)}\, \epsilon$ .

This objective balances variance across timesteps, ensuring stable training and effective coverage of the conditional interpolation dynamics (Tan et al., 28 Nov 2025).

4. Transformer Architecture and Large-Scale Scaling

ViBT employs a DiT-style latent transformer backbone with the following structure:

Patch embedding followed by $L$ transformer blocks and patch-unembedding.
Multi-head self-attention, feed-forward MLPs, and explicit timestep conditioning via sinusoidal embeddings and FiLM layers.
Conditioning on both $x_0$ and $x_1$ is handled natively inside self-attention: projections jointly attend to $x_t$ and endpoint features, avoiding token length blow-up seen in conditional DiT models.

Model parameters are summarized as follows:

Task	Parameters	Layers ( $L$ )	Hidden Dim	Heads	MLP Dim	Conditioning	Fine-tuning
Image Editing	20B	64	4096	64	16,384	Endpoints	LoRA adapters (rank 128)
Video Generation	1.3B	24	1024	16	4096	Endpoints	Fully fine-tuned

No projection layer sharing is performed across depth; projection weights are shared across spatial positions only. Training uses the Prodigy optimizer, mixed-precision (FP16), and batch size of 1 latent pair per device, without explicit weight decay (LoRA providing implicit regularization for 20B models) (Tan et al., 28 Nov 2025).

5. Empirical Performance on Conditional Vision Tasks

ViBT achieves state-of-the-art or near state-of-the-art results across several conditional vision domains:

Instruction-based Image Editing (ImgEdit-Bench): ViBT (s=0.5) achieves an average score of 3.76, outperforming InstructPix2Pix, Step1X-Edit, FLUX.1 Kontext, UniWorld, and closely matching Qwen-Image-Editing 20B. Performance is enhanced on “Add” and “Style” edit types; qualitative inspection confirms alignment to instructions with preservation of detail.
Video Stylization: On a Ditto-1M subset, ViBT yields NIQE=4.328, TOPIQ=0.503, MUSIQ=64.045, MANIQA=0.348, CLIPIQA=0.486, CLIPScore=0.782. Outperforms TokenFlow, InsV2V, and RAVE on these metrics.
Depth-to-Video Translation: Benchmarked on VBench, ViBT attains SSIM=0.429, PSNR=11.403, NIQE=4.896, DISTS=0.230, CLIPScore=0.781, VBench=0.71, exceeding the results of ControlVideo, Control-A-Video, VideoComposer, and Wan Fun.
Additional tasks: Video colorization and frame interpolation are completed within only 4–8 inference steps, demonstrating fast inference.

A notable efficiency is achieved via token count halving (e.g., 4096 vs. 8192 at 1024×1024), yielding per-step speedups of 2.3× (images) and 3–4× (videos) relative to conditional DiT models (Tan et al., 28 Nov 2025).

6. Ablation Studies and Comparative Analysis

Several ablation studies highlight the design choices:

Loss Functions: The stabilized velocity objective consistently outperforms displacement and naive velocity objectives for both image and video tasks.
Noise Scale: For video tasks, noise scale $s=1$ or $s=2$ is optimal; for image editing, $s=0.5$ yields superior results. Extreme $s$ values degrade quality.
Sampling: The variance-corrected sampling scheme produces cleaner outputs compared to naive Euler–Maruyama, eliminating sampling artifacts.
Timestep Schedule and Inference Steps: A shift $\gamma=5$ in the timestep schedule and 8–16 inference steps balance speed and quality effectively.

In tasks focused on efficiency, ViBT matches or surpasses traditional diffusion transformers, contingent on use of the variance-stabilized objective and variance-corrected sampling approach (Tan et al., 28 Nov 2025).

7. Broader Context and Future Directions

ViBT demonstrates that large-scale Brownian Bridge models can match or exceed diffusion transformer benchmarks in conditional generation, with substantial gains in computational efficiency. The direct data-to-data stochastic modeling paradigm natively supports structured conditionings while minimizing memory and compute overhead. The integration of a stabilized objective function and strictly variance-aware sampling is central to scaling and performance.

This suggests further scaling of bridge-based generative models and variants of the variance stabilization objective may generalize to a wider array of multimodal and high-resolution conditional tasks. A plausible implication is that bridge-based SDEs may supersede or augment current diffusion architectures in resource-constrained or highly-conditional generative vision pipelines (Tan et al., 28 Nov 2025).

PDF Markdown Chat (Pro)

References (1)

Vision Bridge Transformer at Scale (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Vision Bridge Transformer (ViBT).