RayZer: A Self-supervised Large View Synthesis Model (2505.00702v1)

Published 1 May 2025 in cs.CV

Abstract: We present RayZer, a self-supervised multi-view 3D Vision model trained without any 3D supervision, i.e., camera poses and scene geometry, while exhibiting emerging 3D awareness. Concretely, RayZer takes unposed and uncalibrated images as input, recovers camera parameters, reconstructs a scene representation, and synthesizes novel views. During training, RayZer relies solely on its self-predicted camera poses to render target views, eliminating the need for any ground-truth camera annotations and allowing RayZer to be trained with 2D image supervision. The emerging 3D awareness of RayZer is attributed to two key factors. First, we design a self-supervised framework, which achieves 3D-aware auto-encoding of input images by disentangling camera and scene representations. Second, we design a transformer-based model in which the only 3D prior is the ray structure, connecting camera, pixel, and scene simultaneously. RayZer demonstrates comparable or even superior novel view synthesis performance than ``oracle'' methods that rely on pose annotations in both training and testing. Project: https://hwjiang1510.github.io/RayZer/

Summary

The paper introduces RayZer, a self-supervised transformer model that learns to predict camera poses and scene representations from unposed images, driven solely by photometric consistency between rendered and input views.
RayZer achieves novel view synthesis quality comparable to supervised methods on real datasets without requiring costly 3D ground-truth data like camera poses or explicit 3D geometry during training.
This self-supervised approach enables view synthesis from readily available unposed image collections, offering scalability and robustness to noisy data compared to methods reliant on precise 3D annotations.

RayZer (2505.00702) is a self-supervised large multi-view 3D vision model designed to perform novel view synthesis from unposed and uncalibrated images without requiring any 3D supervision (like camera poses or 3D geometry) during training. The paper aims to break free from the limitations of traditional 3D vision methods and recent large reconstruction models that heavily rely on accurate ground-truth 3D annotations, which are often costly to obtain or noisy (e.g., from COLMAP).

The core idea behind RayZer is a self-supervised learning framework where the model predicts camera parameters and a scene representation from unposed input images and then uses these self-predicted parameters to render images. The supervision signal comes solely from the photometric difference between the rendered images and the original input images. This can be viewed as a 3D-aware image auto-encoding process.

Here's a breakdown of the model architecture and implementation details:

Self-Supervised Framework:

Input: A set of unposed and uncalibrated multi-view images $\mathcal{I}$ .
Data Split: The input images $\mathcal{I}$ are randomly split into two non-overlapping sets: $\mathcal{I}_\mathcal{A}$ and $\mathcal{I}_\mathcal{B}$ .
Processing:
- RayZer first predicts camera parameters (intrinsics and poses) for all images in $\mathcal{I}$ .
- It then uses the predicted camera parameters for $\mathcal{I}_\mathcal{A}$ along with the images in $\mathcal{I}_\mathcal{A}$ to predict a latent scene representation.
- Finally, it uses the predicted latent scene representation and the predicted camera parameters for $\mathcal{I}_\mathcal{B}$ to render target images $\hat{\mathcal{I}_\mathcal{B}}$ .
Supervision: A photometric loss (MSE + perceptual loss) is computed between the rendered images $\hat{\mathcal{I}_\mathcal{B}}$ and the corresponding ground-truth images $\mathcal{I}_\mathcal{B}$ . This loss drives the self-supervised learning process without needing any external 3D annotations.

RayZer Model Architecture:

RayZer is built entirely using transformer layers, following the trend of large foundation models in other modalities. The architecture follows a cascaded approach: camera estimation first, followed by scene reconstruction.

Image Tokenization: Input images are patchified into non-overlapping patches, flattened, and linearly projected into tokens. Sinusoidal spatial positional embeddings and image index positional embeddings are added to these tokens.
Camera Estimator ( $\mathcal{E}_{cam}$ ):
- Takes tokenized image features ( $\mathbf{f}$ ) and learnable camera tokens ( $\mathbf{p}$ ) (one per image) as input.
- Uses self-attention transformer layers operating on the union of these token sets.
- Outputs updated camera tokens ( $\mathbf{p}^*$ ).
- Pose Prediction: An MLP takes the camera token for the current view and a designated canonical reference view and predicts relative $SE(3)$ poses. $SO(3)$ rotation is parameterized using a continuous 6D representation.
- Intrinsics Prediction: An MLP takes the camera token of the canonical view and predicts a single focal length (assuming shared intrinsics, square pixels, and principal point at the center).
- Plücker Ray Conversion: The predicted camera parameters (poses and intrinsics) are converted into pixel-aligned Plücker ray maps. These maps represent the camera's geometry for each pixel as a ray in 3D space and serve as a crucial 3D prior.
Scene Reconstructor ( $\mathcal{E}_\mathit{scene}$ ):
- Input images from $\mathcal{I}_\mathcal{A}$ and their corresponding predicted Plücker ray maps are tokenized and fused via an MLP.
- These fused tokens, along with a set of learnable scene tokens ( $\mathbf{z}$ ), are processed by self-attention transformer layers.
- Outputs the learned latent set scene representation ( $\mathbf{z}^*$ ). This representation is not explicitly 3D-aware but is learned through the end-to-end process.
Rendering Decoder ( $\mathcal{D}_\mathit{render}$ ):
- Takes tokenized Plücker rays for a target view and the predicted scene representation ( $\mathbf{z}^*$ ) as input.
- Uses self-attention transformer layers to fuse this information.
- An MLP decodes the updated ray tokens into patch-level RGB values for the target image.

Practical Implementation Details & Considerations:

Transformer Backbone: The model uses standard transformer blocks with self-attention. QK-Norm is applied for training stability. Depth-wise initialization is used for transformer layers.
Latent Representation: A latent set of 3072 tokens with a dimension of 768 is used for the scene representation and feature processing.
Computational Resources: Training requires significant resources, reported as 32 A100 GPUs with a total batch size of 256.
Training Protocol: Uses mixed precision (BF16), FlashAttention-V2, gradient checkpointing for efficiency, warm-up, cosine learning rate decay, and gradient clipping.
Canonical View: Selecting a middle frame (rather than the first) as the canonical view for pose prediction improved performance, likely due to lower initial pose variance.
Curriculum Learning: A curriculum that gradually increases the distance range between sampled video frames helps stabilize training, especially for pose estimation.
Inference: RayZer performs inference in a feed-forward manner, predicting cameras and the scene representation directly, enabling fast novel view synthesis without per-scene optimization.

Performance and Applications:

Novel View Synthesis: RayZer demonstrates novel view synthesis quality comparable to or even exceeding supervised "oracle" methods (like GS-LRM (Zhang et al., 30 Apr 2024) and LVSM (Jin et al., 22 Oct 2024)) on real-world datasets annotated with COLMAP poses (DL3DV [22160--22169], RealEstate (Zhou et al., 2018)).
Handling Noisy Data: RayZer's self-supervised nature makes it robust to potential inaccuracies in COLMAP annotations that can limit supervised methods.
Scalability: Being a transformer-based model trained on unlabeled data, RayZer is well-positioned for scaling to larger datasets.
Pose Awareness: While the learned pose space may not perfectly align with real-world SE(3), it is shown to be geometrically aware and interpolatable (on synthetic data), allowing for novel view synthesis along plausible trajectories.
Potential Applications: Enabling 3D reconstruction and view synthesis from readily available unposed image collections (e.g., videos, online photo albums) without the need for explicit camera tracking or complex multi-view stereo pipelines. This could impact applications like immersive media creation, virtual tours, and large-scale 3D mapping from internet data.

Limitations:

Despite strong overall performance, RayZer (and comparable methods) can still struggle with scenes containing intricate geometry, complex materials (like high specularity or transparency), and significant occlusions not present in the input views.
The learned camera pose space, while 3D-aware for synthesis, doesn't perfectly recover real-world poses without additional probing or supervision. The interplay between learning pose estimation and video interpolation cues in the self-supervised signal requires further understanding.