AnySplat: Pose-Free Neural Rendering

Updated 4 August 2025

AnySplat is a feed-forward neural rendering framework that predicts 3D Gaussian primitives and camera parameters for novel view synthesis.
It employs a transformer-based architecture with differentiable voxelization to jointly encode scene geometry and camera estimation in a single pass.
AnySplat delivers competitive quality and speed compared to pose-aware methods, making it highly applicable for VR/AR, robotics, and rapid scene digitization.

AnySplat is a feed-forward neural rendering framework for novel view synthesis from uncalibrated, multi-view image collections. Unlike traditional neural rendering pipelines that require known camera poses and per-scene optimization or recent feed-forward models that are computationally inefficient with dense input, AnySplat predicts a set of 3D Gaussian primitives encoding both geometry and appearance, as well as camera intrinsics and extrinsics for each input image, in a single forward pass. This enables real-time, pose-free scene reconstruction and novel view synthesis at quality levels comparable to state-of-the-art pose-aware baselines (Jiang et al., 29 May 2025).

1. Model Architecture and Gaussian Scene Representation

AnySplat utilizes a transformer-based architecture for joint scene and camera estimation from multiple unposed images. Each image is tokenized by patching (e.g., using DINOv2 features), and per-frame tokens—including a learnable camera token and four register tokens—are appended. The multi-view tokens are processed by an alternating-attention transformer: frame-local attention is applied first, followed by global attention across views, to build a unified scene representation.

The network is structured as follows:

Camera Decoder ( $F_C$ ): Processes camera tokens through self-attention layers and a linear projection, predicting camera parameters $p_i \in \mathbb{R}^9$ (intrinsics and extrinsics).
Depth Decoder ( $F_D$ ): Predicts dense depth maps using a DPT-inspired decoder.
Gaussian Parameter Decoder ( $F_G$ ): Outputs the parameters for each 3D Gaussian primitive.

Each 3D Gaussian is parameterized by:

Center $\mu \in \mathbb{R}^3$
Opacity $\sigma \in \mathbb{R}^+$
Orientation quaternion $r \in \mathbb{R}^4$
Anisotropic scale $s \in \mathbb{R}^3$
Spherical harmonic coefficients $c \in \mathbb{R}^{3 \times (k+1)^2}$ for view-dependent color

The predicted depth maps and camera poses allow back-projection of image pixels into 3D space, determining the Gaussian centers. The model thus outputs both a comprehensive 3D Gaussian Splat scene and all camera parameters in a single forward pass.

2. Novel View Synthesis and Pose-Free Training

AnySplat is expressly designed for uncalibrated inputs, eliminating the dependence on external structure-from-motion or COLMAP pipelines. It jointly infers camera poses and scene geometry, optimizing them together in an end-to-end self-supervised regime, using only image-level losses.

For novel view synthesis, the predicted 3D Gaussians are rendered using a differentiable Gaussian rasterizer, an approach inherited from earlier 3DGS methods. The loss function combines mean squared error (MSE) and perceptual losses on the RGB output with geometric consistency and depth losses.

Key differences from established volumetric neural rendering and SplatDiffusion frameworks include:

Complete elimination of per-scene optimization.
No requirement for camera annotations at any stage.
Unified feed-forward inference for both scene and camera estimation.

Compared to NeRF-like architectures, which require pose supervision and slow optimization-based reconstruction, AnySplat is several orders of magnitude faster, particularly as input view count increases.

3. Differentiable Voxelization and Computational Efficiency

To further enhance efficiency, AnySplat introduces a differentiable voxelization module. The Gaussian centers $\mu_g$ are assigned to voxels $\{V_s\}$ with voxel size $\epsilon$ :

${V_s} = \left\lfloor \frac{\{\mu_g\}}{\epsilon} \right\rfloor$

Within each voxel, per-Gaussian attributes $a_g$ (e.g., opacity, color) are aggregated using softmax weights derived from a confidence score $C_g$ :

$w_{g \to s} = \frac{\exp(C_g)}{\sum_{h: V^h = s} \exp(C_h)}$

$\bar{a}_s = \sum_{g: V^g = s} w_{g \to s}\, a_g$

This technique reduces redundant primitives by 30–70% without compromising output quality, substantially lowering both computation and memory requirements.

4. Performance Metrics and Qualitative Results

Comprehensive zero-shot evaluations on VRNeRF, Deep Blending, and CO3Dv2 demonstrate that AnySplat matches or exceeds the performance of pose-aware baselines in both sparse and dense view regimes, surpassing prior pose-free approaches.

Table: Key Quantitative Benchmarks (from (Jiang et al., 29 May 2025))

Scenario	PSNR (dB)	Relative Speed	Pose Requirement
Sparse-view (3 views)	~20.63	Seconds	None (predicted)
Dense-view (32–64 views)	~23.0	1.4–4.1 seconds	None (predicted)
Baseline (3DGS/Mip-Splat)	~22–24 (ref)	~10 min (optimization)	Externally supplied

In qualitative comparisons, AnySplat’s synthesized views exhibit sharper details and reduced misalignment or ghosting artifacts, particularly in regions of significant parallax or ambiguous geometry, largely due to its improved pose prediction and depth estimation.

Pose estimation performance is on par with dedicated predictors such as VGGT; however, AnySplat outputs a robust multi-view consistent scene structure, which improves rendered view fidelity.

5. Self-Supervised Losses and Training Regime

The total loss function incorporates multiple consistency and supervision terms:

RGB (Photometric) Loss: $L_{rgb}$ on rendered versus ground-truth images.
Geometry Consistency: $L_g = \frac{1}{N} \sum (D_i[M] - \hat{D}_i[M])^2$ comparing the input’s depth $D_i$ with rendered depth $\hat{D}_i$ , masked to high-confidence regions $M$ .
Camera Distillation Loss: $L_p = \frac{1}{N} \sum \|\tilde{p}_i - p_i\|_\epsilon$ , where $\tilde{p}_i$ is a pseudo ground-truth from a pretrained VGGT network and $p_i$ is the predicted camera.
Total Loss: $L = L_{rgb} + \lambda_2 L_g + \lambda_3 L_p + \lambda_4 L_d$

All supervision is self-consistent and derived from input images, allowing AnySplat to operate with zero explicit pose labels during training or inference.

6. Applications, Limitations, and Future Directions

AnySplat is readily applicable to:

Interactive VR/AR and live 3D scene reconstruction due to its real-time performance.
Online content creation and rapid scene digitization, leveraging only uncalibrated counterparts from commodity cameras.
Robotics and autonomous navigation where unknown and unconstrained views predominate.
Large-scale digitization projects in fields such as urban mapping and cultural heritage.

Current limitations include handling of thin structures, specular surfaces, and featureless regions (e.g., sky), which remain challenging due to inherent modality ambiguities in the image-to-3D lifting process. Increasing training diversity, enhancing robustness to repetitive texture, and further optimizing the primitive-resolution trade-off are identified as future work areas. Adaptations for higher flexibility in patch size and even faster reduction in Gaussian count are also proposed to further scale the approach.

7. Comparative Context and Significance

AnySplat advances the Gaussian Splatting paradigm by offering a genuinely real-time, pose-free solution that requires no external optimization or calibration. In contrast, teacher-guided diffusion approaches such as SplatDiffusion (Peng et al., 1 Dec 2024) employ generative denoising to refine teacher predictions under 2D supervision but continue to rely on external teacher models for input geometry and are not fully feed-forward or end-to-end pose-predictive.

By unifying scene geometry, appearance, and camera estimation in a single transformer-based feed-forward pass, and coupling this with differentiable voxelization, AnySplat establishes a new baseline for practical, high-quality 3D scene reconstruction from unconstrained multiview capture. Its impact is particularly pronounced in scenarios where capture convenience, speed, and lack of camera metadata are priority requirements.