Papers
Topics
Authors
Recent
2000 character limit reached

UViM: Unified Vision Model for Dense Prediction

Updated 17 October 2025
  • The paper introduces a unified vision model that uses a discrete guiding code and a two-stage training scheme to decouple global structure from local details.
  • It employs a feed-forward Vision Transformer and an autoregressive language model to map high-dimensional inputs into precise dense predictions for tasks such as panoptic segmentation and depth estimation.
  • Empirical results on COCO Panoptic and NYU Depth v2 benchmarks show competitive performance without relying on task-specific architectural modifications.

UViM (Unified Vision Model) for dense prediction refers to a modeling paradigm that produces pixel- or voxel-level structured outputs for various vision tasks using a unified model architecture and a learned discrete “guiding code.” The approach is distinguished by its two-stage learning scheme and objective to handle diverse structured prediction tasks—including panoptic segmentation and depth estimation—without task-specific architectures or expert-designed modifications. Central to UViM is the decomposition of dense prediction into (I) a feed-forward base model that decodes high-dimensional outputs given both an input image and a compact guiding code, and (II) an autoregressive transformer LLM that generates the guiding code conditioned solely on the image. This architecture reduces the output-space complexity and enables a general-purpose, high-capacity framework for vision tasks.

1. Discrete Guiding Code and Two-Stage Training

The defining feature of UViM for dense prediction is its use of a short, quantized discrete code (the guiding code zz) that encodes global structure or layout information about the output yy. The overall system is trained via a two-stage procedure:

  • Stage I (Oracle-Guided Training):

A restricted oracle Ω\Omega maps the ground-truth output yy to a guiding code z=Ω(y)z = \Omega(y). The base model f:X×ZYf: \mathcal{X} \times \mathcal{Z} \rightarrow \mathcal{Y} is trained to output the dense prediction given both xx and zz, optimizing a relevant reconstruction loss:

minf,Ω L(f(x,Ω(y)), y)\min_{f,\,\Omega} \ \mathcal{L}\big(f(x, \Omega(y)),\ y\big)

The code zz is produced by quantizing continuous features—specifically, mapping each to the nearest codeword in a learned dictionary D={D1,,DN}D = \{D_1,\ldots,D_N\} using the VQ-VAE approach,

zi=argminje(i)Dj,i=1,,nz_i = \arg\min_j \| e_{(i)} - D_j \|,\quad i=1,\ldots,n

where e(i)e_{(i)} is the embedding for the ii-th code element.

  • Stage II (Code Prediction via LLM):

An autoregressive transformer-based LLM (LM) predicts the code zz from the input image xx:

p(zx)=k=1np(zkz1,,zk1, x)p(z|x) = \prod_{k=1}^n p(z_k\,|\,z_1,\ldots,z_{k-1},\ x)

The final dense prediction is made as:

y^=f(x, LM(x))\hat{y} = f\big(x,\ LM(x)\big)

This eliminates oracle dependence during inference. The LM is typically an encoder–decoder transformer: the encoder (e.g., a ViT pre-trained on ImageNet-21k) processes xx, and the decoder autoregressively predicts zz.

2. Structural and Functional Roles of the Guiding Code

The compact guiding code zz plays several interlocking roles:

  • It captures global scene layout and high-level structure, acting as a global “hint” to disambiguate the dense output.
  • It factorizes the joint label dependencies, dramatically reducing the number of output correlations the base model must directly learn. Without zz, ff would have to capture all long-range coherency directly, which is impractical for dense outputs involving hundreds of thousands of dependent variables.
  • The code is quantized and of fixed, relatively small length (e.g., n=256n=256), in contrast to the millions of pixels in a typical output mask.
  • A code dropout mechanism is employed during training to encourage the base model not to overly rely on individual code elements.

This decomposition allows leveraging the relative strengths of two model classes: feed-forward vision transformers for high-dimensional output mapping, and autoregressive transformers for modeling code interdependencies.

3. Task-Specific Instantiations

Although UViM is task-agnostic in architecture, its training instantiations for dense prediction tasks are notable:

  • Panoptic Segmentation:

The panoptic target yy comprises a two-channel mask (semantic class, instance ID). Given output resolutions (e.g., 512×512512 \times 512), the output space is extremely high-dimensional. The oracle compresses yy to a code zz of length 256. The base model and oracle are ViTs, trained with pixel-wise cross-entropy, while the LM is trained to generate zz autoregressively. On the COCO Panoptic benchmark, UViM achieves a panoptic quality (PQ) of 45.8, comparable to DETR-R101 (45.1) but lower than specialized models such as Mask2Former (57.8).

  • Monocular Depth Estimation:

Ground-truth depth is quantized into discrete bins (e.g., 256 bins), and the target is compressed into a guiding code as above. Reconstruction loss is softmax cross-entropy between predicted and ground-truth bins. On NYU Depth v2, UViM achieves an RMSE of approximately 0.467, similar to DenseDepth, using the same vision transformer backbone as in the segmentation setting.

  • Colorization (Additional Example):

The same framework is adapted to image colorization, again relying on unified architecture and loss adaptation.

4. Training and Computational Details

  • Architecture:

Both oracle Ω\Omega and base model ff are implemented as Vision Transformers (e.g., ViT-B/16).

  • Resolution:

Resolutions are fixed at 512×512512 \times 512 for both input and output.

  • Stage I training:

May require hundreds to over a thousand epochs depending on the task (e.g., panoptic segmentation).

  • Stage II training:

Utilizes encoder–decoder transformers, frequently initialized from robust pretraining (e.g., ImageNet).

  • Computational footprint:

Training for panoptic segmentation required \sim1.9k TPU-v3 hours for Stage I and \sim0.9k TPU-v3 hours for Stage II.

5. Quantitative Performance and Comparative Summary

UViM matches or approaches the performance of state-of-the-art single-task models on both panoptic segmentation and depth estimation:

Task / Dataset UViM Performance Reference Baseline(s)
COCO Panoptic 45.8 PQ DETR-R101: 45.1, Mask2Former: 57.8
NYU Depth v2 0.467 RMSE DenseDepth: similar

These results are achieved without extensive task-specific tuning or architecture modification. The architecture serves as a general-purpose model for dense vision tasks, indicating the efficacy of guiding-code factorization and unified modeling.

6. Limitations and Domain Implications

UViM’s framework is not without constraints:

  • PQ and RMSE remain below highly optimized and specialized approaches. For example, In panoptic segmentation, Mask2Former achieves substantially higher PQ.
  • The need for staged, multi-model training increases overall training pipeline complexity and resource demands.
  • The compactness constraint on the guiding code may limit representation of extremely complex spatial arrangements, especially in tasks with intricate global dependencies.

Nevertheless, the core insight, that compressing dense interdependencies into a short learned code and handling code prediction with an autoregressive LM, provides a scalable and universal modeling blueprint. This design obviates the need for painstaking manual architectural tuning and enables a unified, reusable vision system for a variety of dense prediction tasks.

7. Conceptual Summary

UViM for dense prediction integrates a base vision transformer and an autoregressive code generation model to decompose the challenge of structured, high-dimensional output. Key technical processes include VQ-inspired code quantization, autoregressive modeling of discrete codes, code dropout for robustness, and end-to-end training via reconstruction losses. The result is a train-once, deploy-everywhere model family for segmentation, depth estimation, and related dense tasks, with explicit decoupling of global structure modeling (LM) and high-capacity output mapping (base model). The framework’s effectiveness across task types with minimal adaptation signals a significant step toward universal dense vision models.

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to UViM for Dense Prediction.