InfiniDepth: Neural Implicit Depth Estimation

Updated 7 January 2026

InfiniDepth is a neural architecture that estimates monocular depth using continuous implicit fields, allowing for arbitrary-resolution predictions.
It employs a multi-scale local implicit decoder to hierarchically fuse features from different resolutions, achieving state-of-the-art accuracy on synthetic and real-world benchmarks.
The design facilitates high-quality geometric representations for novel-view synthesis and fine-detail recovery, outperforming traditional discrete grid approaches.

InfiniDepth is a neural architecture for monocular depth estimation that models depth as a continuous implicit field, enabling arbitrary-resolution and fine-grained depth prediction. By replacing traditional discrete image-grid prediction with neural implicit fields, InfiniDepth provides scalability to any output resolution, addresses fine-detail recovery, and facilitates high-quality geometric representations suitable for novel-view synthesis. The method introduces a multi-scale local implicit decoder to efficiently query depth at continuous coordinates and demonstrates state-of-the-art results across synthetic and real-world benchmarks with particular strength in high-frequency, detail-rich regions (Yu et al., 6 Jan 2026).

1. Mathematical Formulation: Neural Implicit Depth Function

InfiniDepth establishes depth estimation as regression over a continuous 2D domain. The general form of an implicit field is: $\mathbf y = F_\theta(\mathbf x)$ where $F_\theta$ denotes a multi-layer perceptron (MLP). For monocular depth estimation conditioned on an image $I$ , the mapping becomes: $d_I(x, y) = N_\theta( I, (x, y) ),\quad (x, y) \in [0, W] \times [0, H]$ Here, $N_\theta: \mathbb{R}^{H\times W\times 3} \times \mathbb{R}^2 \to \mathbb{R}$ maps an input RGB image and a continuous coordinate to a scalar depth value. The final depth prediction is expressed as: $d_I(x,y) = \mathrm{MLP}(\mathbf h_L)$ where $\mathbf h_L$ is a locally fused feature vector derived through hierarchical multi-scale aggregation.

2. Multi-Scale Local Implicit Decoder Architecture

Input Encoding

Input: RGB image $I \in \mathbb{R}^{H\times W \times 3}$
Encoding: Processed using a Vision Transformer (ViT-Large, DINOv3)
Feature Extraction: From transformer layers 4, 11, and 23, token maps are projected to feature maps at resolutions

$f^k \in \mathbb{R}^{h_k \times w_k \times C^k} \quad (k=1,2,3)$

with $(h_1, w_1) = 4 \times H/16$ , $(h_2, w_2) = 2 \times H/16$ , $(h_3, w_3) = 1 \times H/16$ and channels $C^1 = 256, C^2 = 512, C^3 = 1024$ .

Continuous Feature Query and Fusion

For a continuous coordinate $(x, y)$ , corresponding coordinates at each scale are mapped as:

$(x_k, y_k) = \left(x \cdot \frac{w_k}{W},\; y \cdot \frac{h_k}{H}\right)$

Local neighborhoods $\mathcal N_k$ ( $2 \times 2$ area) are bilinearly interpolated to produce local descriptors $f^k_{(x, y)}$ .
Hierarchical fusion proceeds as:

$\mathbf h_{k+1} = \mathrm{FFN}_k \left( f^{k+1}_{(x, y)} + \mathbf{g}_k \odot \mathrm{Linear}(\mathbf h_k) \right)$

$\mathrm{Linear}$ : Projects $\mathbf h_k$ to channel space $C^{k+1}$
$\mathbf{g}_k$ : Learnable channel-wise gate ( $0,1)^{C^{k+1}}$ )
$\mathrm{FFN}_k$ $FFN_{k}$ : Two-layer MLP block, expansion factor 4, ReLU activation, residual connection
- The final fused feature $\mathbf h_L \in \mathbb{R}^{1024}$ advances through a 3-layer MLP (input 1024, hidden 256, ReLU activations, terminal ELU) to yield the scalar depth value.

3. Training Objective and Loss Functions

Supervision is performed on random continuous coordinate–depth tuples: $\{(x_i, y_i, d_i)\}_{i=1}^N$ with primary objective: $\mathcal{L} = \frac{1}{N} \sum_{i=1}^N \left|\, \hat d_i - d_i \right|$ where $\hat d_i = d_I(x_i, y_i)$ . Ground-truth depths are normalized to log-space as: $d_{\text{norm}} = \frac{d_{\log} - d_{\min}}{d_{\max} - d_{\min}}, \quad d_{\log} = \ln d$ with $d_{\min}, d_{\max}$ the 2nd and 98th percentiles of $\ln d$ per-image.

For the monocular model, no additional gradient or smoothness regularization is applied. In the context of novel-view synthesis with the Gaussian-Splatting head, an additional $L_1$ RGB reconstruction loss and LPIPS perceptual loss are incorporated.

4. Inference and Arbitrary-Resolution Depth Querying

Depth inference does not require fixed-grid upsampling or convolution, enabling pointwise continuous prediction:

The input image is encoded at native resolution to generate feature pyramids $\{f^k\}_{k=1}^3$ .
To produce a depth map at arbitrary resolution (e.g., 3840×2160), each output point $(x, y)$ $(x, y)$ is:
- Mapped to each scale $(x_k, y_k)$
- Used to bilinearly interpolate local features $f^k_{(x, y)}$
- Processed through hierarchical fusion into $\mathbf h_L$
- Passed to the MLP head for $\hat d = d_I(x, y)$

For novel-view synthesis:

The implicit field is sampled on a grid, and depths are backprojected to 3D.
Per-pixel surface area weights are computed:

$w(x, y) = d_I(x, y)^2 |\mathbf n(x, y) \cdot \mathbf v(x, y)| + \varepsilon$

where $\mathbf n(x, y)$ is the normal (via autograd) and $\mathbf v(x, y)$ the view vector.

Surface samples are drawn by inverse-CDF, then re-queried for a dense, uniform point cloud suitable for Gaussian splatting.

5. The Synth4K Benchmark

Benchmarking is conducted on the Synth4K dataset, comprising 4K-resolution frames (3840×2160) from five AAA games with diverse scene geometry and appearance:

Subset	Game Title
Synth4K-1	CyberPunk 2077
Synth4K-2	Marvel’s Spider-Man 2
Synth4K-3	Miles Morales
Synth4K-4	Dead Island
Synth4K-5	Watch Dogs

Each subset contains hundreds of frames with varied content (indoor/outdoor, lighting, geometry). A high-frequency mask for evaluating fine-detail performance is constructed by applying multi-scale Laplacian energy operators on the depth map, normalizing via the 98th percentile, and sharpening with an exponent $1/\tau$ . Top-k energy pixels define mask regions for detailed evaluation.

Metrics employed include:

Relative depth: $\delta_\tau = \% \{\max(\hat d/d^*, d^*/\hat d) < 1.25^\tau\}$ , $\tau \in \{0.5, 1, 2\}$
Metric depth: $\delta_\epsilon$ for $\epsilon \in \{0.01, 0.02, 0.04\}$

Zero-shot protocol is enforced: all models are tested without fine-tuning. Test images are input at 504×896; for 4K output, baselines are bilinearly upscaled, whereas InfiniDepth is queried directly at the target resolution.

6. Experimental Results and Significance

Zero-Shot Relative Depth Performance

On Synth4K, InfiniDepth achieves superior $\delta_{0.5}, \delta_1, \delta_2$ scores across all subsets ( $\delta_1 \approx 89\%$ ; next best: 84–88%).
In high-frequency regions representing fine details, InfiniDepth attains $\delta_1 \approx 67.5\%$ versus the second-best at 66.5%.

Real-World Benchmarks

On datasets such as KITTI, ETH3D, NYUv2, ScanNet, and DIODE, InfiniDepth is top-3 on all and best on both ETH3D ( $\delta_1 = 99.1\%$ ) and DIODE.

Metric Depth with Sparse LiDAR

Synth4K with 1.5k sparse LiDAR: InfiniDepth+Prompt (Ours-Metric) produces $\delta_{0.01} \approx 78\%$ (prior best: 65%). For fine details, $\delta_{0.01} \approx 33\%$ (prior best: 21%).
Real data: KITTI ( $\delta_{0.01}$ = 63.9% vs 58.3% for PromptDA), ETH3D (96.7% vs 92.8%).

Single-View Novel-View Synthesis

With Infinite Depth Query and a Gaussian-Splatting head, InfiniDepth yields syntheses with fewer holes and artifacts under large viewpoint changes than prior pixel-aligned methods (e.g., ADGaussian), as shown in qualitative examples (Yu et al., 6 Jan 2026).

7. Context and Implications

InfiniDepth replaces conventional grid-based dense prediction with a neural continuous field, directly modeling $d_I(x, y) = N_\theta(I, (x, y))$ and leveraging multi-scale fusion for localized detail retrieval. This enables sub-pixel supervision and resolution-agnostic inference, with no explicit upsampling or convolution in the final stages. The approach attains state-of-the-art results both on synthetic 4K data and established real-world datasets, particularly excelling in geometric detail retrieval. A plausible implication is that this architectural paradigm can generalize to other per-pixel prediction tasks where spatial continuity and resolution flexibility are critical (Yu et al., 6 Jan 2026).

PDF Markdown Chat (Pro)

References (1)

InfiniDepth: Arbitrary-Resolution and Fine-Grained Depth Estimation with Neural Implicit Fields (2026)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to InfiniDepth.