Papers
Topics
Authors
Recent
2000 character limit reached

InfiniDepth: Neural Implicit Depth Estimation

Updated 7 January 2026
  • InfiniDepth is a neural architecture that estimates monocular depth using continuous implicit fields, allowing for arbitrary-resolution predictions.
  • It employs a multi-scale local implicit decoder to hierarchically fuse features from different resolutions, achieving state-of-the-art accuracy on synthetic and real-world benchmarks.
  • The design facilitates high-quality geometric representations for novel-view synthesis and fine-detail recovery, outperforming traditional discrete grid approaches.

InfiniDepth is a neural architecture for monocular depth estimation that models depth as a continuous implicit field, enabling arbitrary-resolution and fine-grained depth prediction. By replacing traditional discrete image-grid prediction with neural implicit fields, InfiniDepth provides scalability to any output resolution, addresses fine-detail recovery, and facilitates high-quality geometric representations suitable for novel-view synthesis. The method introduces a multi-scale local implicit decoder to efficiently query depth at continuous coordinates and demonstrates state-of-the-art results across synthetic and real-world benchmarks with particular strength in high-frequency, detail-rich regions (Yu et al., 6 Jan 2026).

1. Mathematical Formulation: Neural Implicit Depth Function

InfiniDepth establishes depth estimation as regression over a continuous 2D domain. The general form of an implicit field is: y=Fθ(x)\mathbf y = F_\theta(\mathbf x) where FθF_\theta denotes a multi-layer perceptron (MLP). For monocular depth estimation conditioned on an image II, the mapping becomes: dI(x,y)=Nθ(I,(x,y)),(x,y)[0,W]×[0,H]d_I(x, y) = N_\theta( I, (x, y) ),\quad (x, y) \in [0, W] \times [0, H] Here, Nθ:RH×W×3×R2RN_\theta: \mathbb{R}^{H\times W\times 3} \times \mathbb{R}^2 \to \mathbb{R} maps an input RGB image and a continuous coordinate to a scalar depth value. The final depth prediction is expressed as: dI(x,y)=MLP(hL)d_I(x,y) = \mathrm{MLP}(\mathbf h_L) where hL\mathbf h_L is a locally fused feature vector derived through hierarchical multi-scale aggregation.

2. Multi-Scale Local Implicit Decoder Architecture

Input Encoding

  • Input: RGB image IRH×W×3I \in \mathbb{R}^{H\times W \times 3}
  • Encoding: Processed using a Vision Transformer (ViT-Large, DINOv3)
  • Feature Extraction: From transformer layers 4, 11, and 23, token maps are projected to feature maps at resolutions

fkRhk×wk×Ck(k=1,2,3)f^k \in \mathbb{R}^{h_k \times w_k \times C^k} \quad (k=1,2,3)

with (h1,w1)=4×H/16(h_1, w_1) = 4 \times H/16, (h2,w2)=2×H/16(h_2, w_2) = 2 \times H/16, (h3,w3)=1×H/16(h_3, w_3) = 1 \times H/16 and channels C1=256,C2=512,C3=1024C^1 = 256, C^2 = 512, C^3 = 1024.

Continuous Feature Query and Fusion

  • For a continuous coordinate (x,y)(x, y), corresponding coordinates at each scale are mapped as:

(xk,yk)=(xwkW,  yhkH)(x_k, y_k) = \left(x \cdot \frac{w_k}{W},\; y \cdot \frac{h_k}{H}\right)

  • Local neighborhoods Nk\mathcal N_k (2×22 \times 2 area) are bilinearly interpolated to produce local descriptors f(x,y)kf^k_{(x, y)}.
  • Hierarchical fusion proceeds as:

hk+1=FFNk(f(x,y)k+1+gkLinear(hk))\mathbf h_{k+1} = \mathrm{FFN}_k \left( f^{k+1}_{(x, y)} + \mathbf{g}_k \odot \mathrm{Linear}(\mathbf h_k) \right)

  • Linear\mathrm{Linear}: Projects hk\mathbf h_k to channel space Ck+1C^{k+1}
  • gk\mathbf{g}_k: Learnable channel-wise gate (0,1)Ck+10,1)^{C^{k+1}})
  • FFNk\mathrm{FFN}_k: Two-layer MLP block, expansion factor 4, ReLU activation, residual connection
    • The final fused feature hLR1024\mathbf h_L \in \mathbb{R}^{1024} advances through a 3-layer MLP (input 1024, hidden 256, ReLU activations, terminal ELU) to yield the scalar depth value.

3. Training Objective and Loss Functions

Supervision is performed on random continuous coordinate–depth tuples: {(xi,yi,di)}i=1N\{(x_i, y_i, d_i)\}_{i=1}^N with primary objective: L=1Ni=1Nd^idi\mathcal{L} = \frac{1}{N} \sum_{i=1}^N \left|\, \hat d_i - d_i \right| where d^i=dI(xi,yi)\hat d_i = d_I(x_i, y_i). Ground-truth depths are normalized to log-space as: dnorm=dlogdmindmaxdmin,dlog=lndd_{\text{norm}} = \frac{d_{\log} - d_{\min}}{d_{\max} - d_{\min}}, \quad d_{\log} = \ln d with dmin,dmaxd_{\min}, d_{\max} the 2nd and 98th percentiles of lnd\ln d per-image.

For the monocular model, no additional gradient or smoothness regularization is applied. In the context of novel-view synthesis with the Gaussian-Splatting head, an additional L1L_1 RGB reconstruction loss and LPIPS perceptual loss are incorporated.

4. Inference and Arbitrary-Resolution Depth Querying

Depth inference does not require fixed-grid upsampling or convolution, enabling pointwise continuous prediction:

  1. The input image is encoded at native resolution to generate feature pyramids {fk}k=13\{f^k\}_{k=1}^3.
  2. To produce a depth map at arbitrary resolution (e.g., 3840×2160), each output point (x,y)(x, y) is:
    • Mapped to each scale (xk,yk)(x_k, y_k)
    • Used to bilinearly interpolate local features f(x,y)kf^k_{(x, y)}
    • Processed through hierarchical fusion into hL\mathbf h_L
    • Passed to the MLP head for d^=dI(x,y)\hat d = d_I(x, y)

For novel-view synthesis:

  • The implicit field is sampled on a grid, and depths are backprojected to 3D.
  • Per-pixel surface area weights are computed:

w(x,y)=dI(x,y)2n(x,y)v(x,y)+εw(x, y) = d_I(x, y)^2 |\mathbf n(x, y) \cdot \mathbf v(x, y)| + \varepsilon

where n(x,y)\mathbf n(x, y) is the normal (via autograd) and v(x,y)\mathbf v(x, y) the view vector.

  • Surface samples are drawn by inverse-CDF, then re-queried for a dense, uniform point cloud suitable for Gaussian splatting.

5. The Synth4K Benchmark

Benchmarking is conducted on the Synth4K dataset, comprising 4K-resolution frames (3840×2160) from five AAA games with diverse scene geometry and appearance:

Subset Game Title
Synth4K-1 CyberPunk 2077
Synth4K-2 Marvel’s Spider-Man 2
Synth4K-3 Miles Morales
Synth4K-4 Dead Island
Synth4K-5 Watch Dogs

Each subset contains hundreds of frames with varied content (indoor/outdoor, lighting, geometry). A high-frequency mask for evaluating fine-detail performance is constructed by applying multi-scale Laplacian energy operators on the depth map, normalizing via the 98th percentile, and sharpening with an exponent 1/τ1/\tau. Top-k energy pixels define mask regions for detailed evaluation.

Metrics employed include:

  • Relative depth: δτ=%{max(d^/d,d/d^)<1.25τ}\delta_\tau = \% \{\max(\hat d/d^*, d^*/\hat d) < 1.25^\tau\}, τ{0.5,1,2}\tau \in \{0.5, 1, 2\}
  • Metric depth: δϵ\delta_\epsilon for ϵ{0.01,0.02,0.04}\epsilon \in \{0.01, 0.02, 0.04\}

Zero-shot protocol is enforced: all models are tested without fine-tuning. Test images are input at 504×896; for 4K output, baselines are bilinearly upscaled, whereas InfiniDepth is queried directly at the target resolution.

6. Experimental Results and Significance

Zero-Shot Relative Depth Performance

  • On Synth4K, InfiniDepth achieves superior δ0.5,δ1,δ2\delta_{0.5}, \delta_1, \delta_2 scores across all subsets (δ189%\delta_1 \approx 89\%; next best: 84–88%).
  • In high-frequency regions representing fine details, InfiniDepth attains δ167.5%\delta_1 \approx 67.5\% versus the second-best at 66.5%.

Real-World Benchmarks

  • On datasets such as KITTI, ETH3D, NYUv2, ScanNet, and DIODE, InfiniDepth is top-3 on all and best on both ETH3D (δ1=99.1%\delta_1 = 99.1\%) and DIODE.

Metric Depth with Sparse LiDAR

  • Synth4K with 1.5k sparse LiDAR: InfiniDepth+Prompt (Ours-Metric) produces δ0.0178%\delta_{0.01} \approx 78\% (prior best: 65%). For fine details, δ0.0133%\delta_{0.01} \approx 33\% (prior best: 21%).
  • Real data: KITTI (δ0.01\delta_{0.01} = 63.9% vs 58.3% for PromptDA), ETH3D (96.7% vs 92.8%).

Single-View Novel-View Synthesis

With Infinite Depth Query and a Gaussian-Splatting head, InfiniDepth yields syntheses with fewer holes and artifacts under large viewpoint changes than prior pixel-aligned methods (e.g., ADGaussian), as shown in qualitative examples (Yu et al., 6 Jan 2026).

7. Context and Implications

InfiniDepth replaces conventional grid-based dense prediction with a neural continuous field, directly modeling dI(x,y)=Nθ(I,(x,y))d_I(x, y) = N_\theta(I, (x, y)) and leveraging multi-scale fusion for localized detail retrieval. This enables sub-pixel supervision and resolution-agnostic inference, with no explicit upsampling or convolution in the final stages. The approach attains state-of-the-art results both on synthetic 4K data and established real-world datasets, particularly excelling in geometric detail retrieval. A plausible implication is that this architectural paradigm can generalize to other per-pixel prediction tasks where spatial continuity and resolution flexibility are critical (Yu et al., 6 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to InfiniDepth.