- The paper introduces Matrix3D, an all-in-one generative model that integrates pose estimation, depth prediction, and novel view synthesis.
- It employs a multi-modal diffusion transformer with unified 2D representations, using techniques like Plücker ray maps and masked auto-encoder training.
- Experimental results show superior performance over baselines in relative rotation accuracy, camera center estimation, and image reconstruction tasks.
The paper introduces Matrix3D, an all-in-one generative model for photogrammetry tasks, addressing limitations of traditional pipelines that require dense image collections and involve multiple independently optimized stages. Matrix3D employs a multi-modal diffusion transformer (DiT) to perform pose estimation, depth prediction, and novel view synthesis.
A core component of Matrix3D is its ability to represent data from various modalities in unified 2D representations. Camera geometries are encoded as Plücker ray maps, while 3D structures are represented as 2.5D depth maps. The multi-modal DiT framework is trained using a masked auto-encoder (MAE) approach, where inputs are randomly masked, and the model predicts the missing information. This strategy allows the model to handle varying input sparsity levels and increases the training data volume by utilizing partially complete data samples, such as image-pose and image-depth pairs. The generated predictions can be used to initialize a 3D Gaussian Splatting (3DGS) optimization for the final output.
The method section details the Matrix3D framework, emphasizing its unified probabilistic model, flexible I/O, and multi-task capability. The model incorporates a multi-view encoder and decoder. The encoder processes multi-view/modality conditions, denoted as xc, and embeds them into a shared latent space. The decoder processes noisy maps xg,t corresponding to different targets at diffusion timestamp t, where x0 represents the ground truth. The diffusion model is trained using a v-prediction loss:
L=Ex0,ϵ,t,y[∥D(E(xc),xg,t,t)−v∥2],
where the velocity target v is defined as
v=αtϵ−σtx0.
Where:
- L is the loss function
- E is the expected value
- x0 is the ground truth
- ϵ is noise
- t is time
- y is an arbitrary variable
- D is the decoder
- E is the encoder
- xc denotes multi-view/modality conditions
- xg,t is the desired generation corrupted by noise ϵ at time t
- v is the velocity target
- αt is a scalar coefficient at time t
- σt is a scalar coefficient at time t
Modality-specific encoding methods are applied, using the VAE from SDXL to encode RGB images and representing camera poses as Plücker ray maps. Multi-view aligned depth is converted into disparities to ensure a compact data range. Positional encodings, including Rotary Positional Embedding (RoPE) and sinusoidal positional encoding, are used to preserve spatial relationships across viewpoints, patch token positions, and modalities.
The masked learning strategy extends beyond masking portions of a single image to masking entire images across multi-view, multi-modal settings. During training, task-specific assignments are applied, dividing training tasks into novel view synthesis, pose estimation, and depth prediction, alongside fully random tasks. A multi-stage training strategy is adopted, progressing from 4-view models at 256 resolution to 8-view models at 512 resolution.
The model is trained on a mixture of six datasets: Objaverse, MVImgNet, CO3D-v2, RealEstate10k, Hypersim, and ARKitScenes, using available modalities for each dataset. Scene normalization is performed based on dataset type and available modalities, while camera normalization sets the first view’s camera as the identity camera while preserving relative transformations between cameras across views.
For downstream tasks, Matrix3D performs pose estimation, multi-view depth estimation, novel view synthesis, or combinations thereof by feeding conditional information and output maps as noise. For single or few-shot image reconstruction, Matrix3D completes multi-modality input and image viewpoints, followed by 3DGS optimization tailored to mitigate multi-view inconsistency among the generated images.
The experiments section presents results for pose estimation, novel view synthesis, and depth prediction. For pose estimation, Matrix3D is evaluated on the CO3D dataset and compared with COLMAP, PoseDiffusion, RelPose++, DUSt3R, and RayDiffusion. The model outperforms baselines in relative rotation accuracy and camera center accuracy.
In novel view synthesis, the model is benchmarked on the GSO dataset against Zero123, Zero123XL, SyncDreamer, Wonder3D, and InstantMesh, achieving competitive results in PSNR, SSIM, and LPIPS metrics.
For depth prediction, the model is evaluated on monocular and multi-view depth prediction tasks. On the DTU dataset, Matrix3D demonstrates superior performance compared to Metric3D v2 and Depth Anything v2 in monocular depth prediction. Quantitative results on DTU for multi-view are presented, with back-projected point clouds evaluated using Chamfer distance.
The paper also evaluates 3D reconstruction performance from single images, comparing Matrix3D with ImageDream, One2345++, IM-3D, and CAT3D in terms of CLIP scores. The method achieves competitive results, and is also evaluated for 3D reconstruction from sparse-view unposed images, integrating pose estimation and reconstruction into a single pipeline. The results show successful reconstruction from unposed images, with quantitative results presented for 3-view reconstruction experiments on the CO3D dataset.
The paper concludes by highlighting Matrix3D's ability to accept flexible input combinations, improving output quality when additional information is provided, such as depth ground truth for novel view synthesis and pose estimation tasks.