Papers
Topics
Authors
Recent
Search
2000 character limit reached

InfiniDepth: Neural Implicit Depth Estimation

Updated 7 January 2026
  • InfiniDepth is a neural architecture that estimates monocular depth using continuous implicit fields, allowing for arbitrary-resolution predictions.
  • It employs a multi-scale local implicit decoder to hierarchically fuse features from different resolutions, achieving state-of-the-art accuracy on synthetic and real-world benchmarks.
  • The design facilitates high-quality geometric representations for novel-view synthesis and fine-detail recovery, outperforming traditional discrete grid approaches.

InfiniDepth is a neural architecture for monocular depth estimation that models depth as a continuous implicit field, enabling arbitrary-resolution and fine-grained depth prediction. By replacing traditional discrete image-grid prediction with neural implicit fields, InfiniDepth provides scalability to any output resolution, addresses fine-detail recovery, and facilitates high-quality geometric representations suitable for novel-view synthesis. The method introduces a multi-scale local implicit decoder to efficiently query depth at continuous coordinates and demonstrates state-of-the-art results across synthetic and real-world benchmarks with particular strength in high-frequency, detail-rich regions (Yu et al., 6 Jan 2026).

1. Mathematical Formulation: Neural Implicit Depth Function

InfiniDepth establishes depth estimation as regression over a continuous 2D domain. The general form of an implicit field is: y=Fθ(x)\mathbf y = F_\theta(\mathbf x) where FθF_\theta denotes a multi-layer perceptron (MLP). For monocular depth estimation conditioned on an image II, the mapping becomes: dI(x,y)=Nθ(I,(x,y)),(x,y)[0,W]×[0,H]d_I(x, y) = N_\theta( I, (x, y) ),\quad (x, y) \in [0, W] \times [0, H] Here, Nθ:RH×W×3×R2RN_\theta: \mathbb{R}^{H\times W\times 3} \times \mathbb{R}^2 \to \mathbb{R} maps an input RGB image and a continuous coordinate to a scalar depth value. The final depth prediction is expressed as: dI(x,y)=MLP(hL)d_I(x,y) = \mathrm{MLP}(\mathbf h_L) where hL\mathbf h_L is a locally fused feature vector derived through hierarchical multi-scale aggregation.

2. Multi-Scale Local Implicit Decoder Architecture

Input Encoding

  • Input: RGB image IRH×W×3I \in \mathbb{R}^{H\times W \times 3}
  • Encoding: Processed using a Vision Transformer (ViT-Large, DINOv3)
  • Feature Extraction: From transformer layers 4, 11, and 23, token maps are projected to feature maps at resolutions

fkRhk×wk×Ck(k=1,2,3)f^k \in \mathbb{R}^{h_k \times w_k \times C^k} \quad (k=1,2,3)

with (h1,w1)=4×H/16(h_1, w_1) = 4 \times H/16, FθF_\theta0, FθF_\theta1 and channels FθF_\theta2.

Continuous Feature Query and Fusion

  • For a continuous coordinate FθF_\theta3, corresponding coordinates at each scale are mapped as:

FθF_\theta4

  • Local neighborhoods FθF_\theta5 (FθF_\theta6 area) are bilinearly interpolated to produce local descriptors FθF_\theta7.
  • Hierarchical fusion proceeds as:

FθF_\theta8

  • FθF_\theta9: Projects II0 to channel space II1
  • II2: Learnable channel-wise gate (II3)
  • II4: Two-layer MLP block, expansion factor 4, ReLU activation, residual connection
    • The final fused feature II5 advances through a 3-layer MLP (input 1024, hidden 256, ReLU activations, terminal ELU) to yield the scalar depth value.

3. Training Objective and Loss Functions

Supervision is performed on random continuous coordinate–depth tuples: II6 with primary objective: II7 where II8. Ground-truth depths are normalized to log-space as: II9 with dI(x,y)=Nθ(I,(x,y)),(x,y)[0,W]×[0,H]d_I(x, y) = N_\theta( I, (x, y) ),\quad (x, y) \in [0, W] \times [0, H]0 the 2nd and 98th percentiles of dI(x,y)=Nθ(I,(x,y)),(x,y)[0,W]×[0,H]d_I(x, y) = N_\theta( I, (x, y) ),\quad (x, y) \in [0, W] \times [0, H]1 per-image.

For the monocular model, no additional gradient or smoothness regularization is applied. In the context of novel-view synthesis with the Gaussian-Splatting head, an additional dI(x,y)=Nθ(I,(x,y)),(x,y)[0,W]×[0,H]d_I(x, y) = N_\theta( I, (x, y) ),\quad (x, y) \in [0, W] \times [0, H]2 RGB reconstruction loss and LPIPS perceptual loss are incorporated.

4. Inference and Arbitrary-Resolution Depth Querying

Depth inference does not require fixed-grid upsampling or convolution, enabling pointwise continuous prediction:

  1. The input image is encoded at native resolution to generate feature pyramids dI(x,y)=Nθ(I,(x,y)),(x,y)[0,W]×[0,H]d_I(x, y) = N_\theta( I, (x, y) ),\quad (x, y) \in [0, W] \times [0, H]3.
  2. To produce a depth map at arbitrary resolution (e.g., 3840×2160), each output point dI(x,y)=Nθ(I,(x,y)),(x,y)[0,W]×[0,H]d_I(x, y) = N_\theta( I, (x, y) ),\quad (x, y) \in [0, W] \times [0, H]4 is:
    • Mapped to each scale dI(x,y)=Nθ(I,(x,y)),(x,y)[0,W]×[0,H]d_I(x, y) = N_\theta( I, (x, y) ),\quad (x, y) \in [0, W] \times [0, H]5
    • Used to bilinearly interpolate local features dI(x,y)=Nθ(I,(x,y)),(x,y)[0,W]×[0,H]d_I(x, y) = N_\theta( I, (x, y) ),\quad (x, y) \in [0, W] \times [0, H]6
    • Processed through hierarchical fusion into dI(x,y)=Nθ(I,(x,y)),(x,y)[0,W]×[0,H]d_I(x, y) = N_\theta( I, (x, y) ),\quad (x, y) \in [0, W] \times [0, H]7
    • Passed to the MLP head for dI(x,y)=Nθ(I,(x,y)),(x,y)[0,W]×[0,H]d_I(x, y) = N_\theta( I, (x, y) ),\quad (x, y) \in [0, W] \times [0, H]8

For novel-view synthesis:

  • The implicit field is sampled on a grid, and depths are backprojected to 3D.
  • Per-pixel surface area weights are computed:

dI(x,y)=Nθ(I,(x,y)),(x,y)[0,W]×[0,H]d_I(x, y) = N_\theta( I, (x, y) ),\quad (x, y) \in [0, W] \times [0, H]9

where Nθ:RH×W×3×R2RN_\theta: \mathbb{R}^{H\times W\times 3} \times \mathbb{R}^2 \to \mathbb{R}0 is the normal (via autograd) and Nθ:RH×W×3×R2RN_\theta: \mathbb{R}^{H\times W\times 3} \times \mathbb{R}^2 \to \mathbb{R}1 the view vector.

  • Surface samples are drawn by inverse-CDF, then re-queried for a dense, uniform point cloud suitable for Gaussian splatting.

5. The Synth4K Benchmark

Benchmarking is conducted on the Synth4K dataset, comprising 4K-resolution frames (3840×2160) from five AAA games with diverse scene geometry and appearance:

Subset Game Title
Synth4K-1 CyberPunk 2077
Synth4K-2 Marvel’s Spider-Man 2
Synth4K-3 Miles Morales
Synth4K-4 Dead Island
Synth4K-5 Watch Dogs

Each subset contains hundreds of frames with varied content (indoor/outdoor, lighting, geometry). A high-frequency mask for evaluating fine-detail performance is constructed by applying multi-scale Laplacian energy operators on the depth map, normalizing via the 98th percentile, and sharpening with an exponent Nθ:RH×W×3×R2RN_\theta: \mathbb{R}^{H\times W\times 3} \times \mathbb{R}^2 \to \mathbb{R}2. Top-k energy pixels define mask regions for detailed evaluation.

Metrics employed include:

  • Relative depth: Nθ:RH×W×3×R2RN_\theta: \mathbb{R}^{H\times W\times 3} \times \mathbb{R}^2 \to \mathbb{R}3, Nθ:RH×W×3×R2RN_\theta: \mathbb{R}^{H\times W\times 3} \times \mathbb{R}^2 \to \mathbb{R}4
  • Metric depth: Nθ:RH×W×3×R2RN_\theta: \mathbb{R}^{H\times W\times 3} \times \mathbb{R}^2 \to \mathbb{R}5 for Nθ:RH×W×3×R2RN_\theta: \mathbb{R}^{H\times W\times 3} \times \mathbb{R}^2 \to \mathbb{R}6

Zero-shot protocol is enforced: all models are tested without fine-tuning. Test images are input at 504×896; for 4K output, baselines are bilinearly upscaled, whereas InfiniDepth is queried directly at the target resolution.

6. Experimental Results and Significance

Zero-Shot Relative Depth Performance

  • On Synth4K, InfiniDepth achieves superior Nθ:RH×W×3×R2RN_\theta: \mathbb{R}^{H\times W\times 3} \times \mathbb{R}^2 \to \mathbb{R}7 scores across all subsets (Nθ:RH×W×3×R2RN_\theta: \mathbb{R}^{H\times W\times 3} \times \mathbb{R}^2 \to \mathbb{R}8; next best: 84–88%).
  • In high-frequency regions representing fine details, InfiniDepth attains Nθ:RH×W×3×R2RN_\theta: \mathbb{R}^{H\times W\times 3} \times \mathbb{R}^2 \to \mathbb{R}9 versus the second-best at 66.5%.

Real-World Benchmarks

  • On datasets such as KITTI, ETH3D, NYUv2, ScanNet, and DIODE, InfiniDepth is top-3 on all and best on both ETH3D (dI(x,y)=MLP(hL)d_I(x,y) = \mathrm{MLP}(\mathbf h_L)0) and DIODE.

Metric Depth with Sparse LiDAR

  • Synth4K with 1.5k sparse LiDAR: InfiniDepth+Prompt (Ours-Metric) produces dI(x,y)=MLP(hL)d_I(x,y) = \mathrm{MLP}(\mathbf h_L)1 (prior best: 65%). For fine details, dI(x,y)=MLP(hL)d_I(x,y) = \mathrm{MLP}(\mathbf h_L)2 (prior best: 21%).
  • Real data: KITTI (dI(x,y)=MLP(hL)d_I(x,y) = \mathrm{MLP}(\mathbf h_L)3 = 63.9% vs 58.3% for PromptDA), ETH3D (96.7% vs 92.8%).

Single-View Novel-View Synthesis

With Infinite Depth Query and a Gaussian-Splatting head, InfiniDepth yields syntheses with fewer holes and artifacts under large viewpoint changes than prior pixel-aligned methods (e.g., ADGaussian), as shown in qualitative examples (Yu et al., 6 Jan 2026).

7. Context and Implications

InfiniDepth replaces conventional grid-based dense prediction with a neural continuous field, directly modeling dI(x,y)=MLP(hL)d_I(x,y) = \mathrm{MLP}(\mathbf h_L)4 and leveraging multi-scale fusion for localized detail retrieval. This enables sub-pixel supervision and resolution-agnostic inference, with no explicit upsampling or convolution in the final stages. The approach attains state-of-the-art results both on synthetic 4K data and established real-world datasets, particularly excelling in geometric detail retrieval. A plausible implication is that this architectural paradigm can generalize to other per-pixel prediction tasks where spatial continuity and resolution flexibility are critical (Yu et al., 6 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to InfiniDepth.