Vision Bridge Transformer (ViBT)
- ViBT is a generative model that directly maps source to target latent representations via a Brownian Bridge stochastic process.
- It employs a variance-stabilized velocity objective and a variance-corrected sampling scheme to enhance training stability and output quality.
- Scalable to 20B and 1.3B parameters, ViBT excels in conditional tasks such as image editing, video stylization, and depth-to-video translation.
Vision Bridge Transformer (ViBT) is a large-scale instantiation of Brownian Bridge Models for conditional vision generation tasks. Unlike the traditional “noise-to-vision” paradigm of diffusion models, ViBT models the stochastic trajectory directly between a source latent and a target latent , embracing a data-to-data translation approach. This is accomplished through a Brownian Bridge stochastic process, parameterized by a Transformer architecture and governed by a variance-stabilized velocity-matching objective and a variance-corrected sampling scheme. ViBT models are scaled to 20B parameters for image editing and 1.3B parameters for video translation, demonstrating empirical advances in efficiency and quality across multiple conditional vision tasks (Tan et al., 28 Nov 2025).
1. Conceptual Foundations and Motivation
Conventional diffusion models require iteratively denoising samples from a Gaussian prior towards the data distribution . Conditioning (e.g., source image, text prompt) is typically injected via auxiliary tokens or cross-attention mechanisms, which is computationally expensive for high-resolution images or extended videos.
ViBT instead directly learns a stochastic mapping from a source to a target latent representation via a Brownian Bridge, fundamentally shifting to a “vision-to-vision” translation paradigm. This approach naturally maintains strong correlations between endpoints, making it inherently suitable for conditioned generation tasks such as instruction-based image editing, style transfer, frame interpolation, and depth-to-video translation. The Brownian Bridge process is conditioned explicitly on , enabling direct and efficient modeling of semantic or structural transformations (Tan et al., 28 Nov 2025).
2. Mathematical Formulation of the ViBT Brownian Bridge
The generative process is framed as a stochastic differential equation (SDE):
with boundary conditions , , where is standard Brownian motion, is a learned velocity field, and is a (possibly time-varying) diffusion coefficient. For the Brownian Bridge specialization, .
Given endpoints , the marginal at time is
The drift at time is
which is parameterized by the transformer network as . This construction ensures the model’s denoising dynamics derive directly from the conditional trajectory between data pairs (Tan et al., 28 Nov 2025).
The discretized sampling scheme corrects for the bridge’s vanishing variance as :
matching the bridge covariance exactly, in contrast to naive Euler–Maruyama updates which result in sampling artifacts.
3. Training Objectives and Stabilization Techniques
Directly minimizing the expected squared error between the model drift and the true bridge drift,
can induce divergence as due to the scaling. Displacement matching (vs. the endpoint difference) avoids this, but underweights late-timestep dynamics. To address this, ViBT introduces a variance stabilization:
where is the latent dimension. The stabilized loss is
with sample .
This objective balances variance across timesteps, ensuring stable training and effective coverage of the conditional interpolation dynamics (Tan et al., 28 Nov 2025).
4. Transformer Architecture and Large-Scale Scaling
ViBT employs a DiT-style latent transformer backbone with the following structure:
- Patch embedding followed by transformer blocks and patch-unembedding.
- Multi-head self-attention, feed-forward MLPs, and explicit timestep conditioning via sinusoidal embeddings and FiLM layers.
- Conditioning on both and is handled natively inside self-attention: projections jointly attend to and endpoint features, avoiding token length blow-up seen in conditional DiT models.
Model parameters are summarized as follows:
| Task | Parameters | Layers () | Hidden Dim | Heads | MLP Dim | Conditioning | Fine-tuning |
|---|---|---|---|---|---|---|---|
| Image Editing | 20B | 64 | 4096 | 64 | 16,384 | Endpoints | LoRA adapters (rank 128) |
| Video Generation | 1.3B | 24 | 1024 | 16 | 4096 | Endpoints | Fully fine-tuned |
No projection layer sharing is performed across depth; projection weights are shared across spatial positions only. Training uses the Prodigy optimizer, mixed-precision (FP16), and batch size of 1 latent pair per device, without explicit weight decay (LoRA providing implicit regularization for 20B models) (Tan et al., 28 Nov 2025).
5. Empirical Performance on Conditional Vision Tasks
ViBT achieves state-of-the-art or near state-of-the-art results across several conditional vision domains:
- Instruction-based Image Editing (ImgEdit-Bench): ViBT (s=0.5) achieves an average score of 3.76, outperforming InstructPix2Pix, Step1X-Edit, FLUX.1 Kontext, UniWorld, and closely matching Qwen-Image-Editing 20B. Performance is enhanced on “Add” and “Style” edit types; qualitative inspection confirms alignment to instructions with preservation of detail.
- Video Stylization: On a Ditto-1M subset, ViBT yields NIQE=4.328, TOPIQ=0.503, MUSIQ=64.045, MANIQA=0.348, CLIPIQA=0.486, CLIPScore=0.782. Outperforms TokenFlow, InsV2V, and RAVE on these metrics.
- Depth-to-Video Translation: Benchmarked on VBench, ViBT attains SSIM=0.429, PSNR=11.403, NIQE=4.896, DISTS=0.230, CLIPScore=0.781, VBench=0.71, exceeding the results of ControlVideo, Control-A-Video, VideoComposer, and Wan Fun.
- Additional tasks: Video colorization and frame interpolation are completed within only 4–8 inference steps, demonstrating fast inference.
A notable efficiency is achieved via token count halving (e.g., 4096 vs. 8192 at 1024×1024), yielding per-step speedups of 2.3× (images) and 3–4× (videos) relative to conditional DiT models (Tan et al., 28 Nov 2025).
6. Ablation Studies and Comparative Analysis
Several ablation studies highlight the design choices:
- Loss Functions: The stabilized velocity objective consistently outperforms displacement and naive velocity objectives for both image and video tasks.
- Noise Scale: For video tasks, noise scale or is optimal; for image editing, yields superior results. Extreme values degrade quality.
- Sampling: The variance-corrected sampling scheme produces cleaner outputs compared to naive Euler–Maruyama, eliminating sampling artifacts.
- Timestep Schedule and Inference Steps: A shift in the timestep schedule and 8–16 inference steps balance speed and quality effectively.
In tasks focused on efficiency, ViBT matches or surpasses traditional diffusion transformers, contingent on use of the variance-stabilized objective and variance-corrected sampling approach (Tan et al., 28 Nov 2025).
7. Broader Context and Future Directions
ViBT demonstrates that large-scale Brownian Bridge models can match or exceed diffusion transformer benchmarks in conditional generation, with substantial gains in computational efficiency. The direct data-to-data stochastic modeling paradigm natively supports structured conditionings while minimizing memory and compute overhead. The integration of a stabilized objective function and strictly variance-aware sampling is central to scaling and performance.
This suggests further scaling of bridge-based generative models and variants of the variance stabilization objective may generalize to a wider array of multimodal and high-resolution conditional tasks. A plausible implication is that bridge-based SDEs may supersede or augment current diffusion architectures in resource-constrained or highly-conditional generative vision pipelines (Tan et al., 28 Nov 2025).