InfiniDepth: Neural Implicit Depth Estimation
- InfiniDepth is a neural architecture that estimates monocular depth using continuous implicit fields, allowing for arbitrary-resolution predictions.
- It employs a multi-scale local implicit decoder to hierarchically fuse features from different resolutions, achieving state-of-the-art accuracy on synthetic and real-world benchmarks.
- The design facilitates high-quality geometric representations for novel-view synthesis and fine-detail recovery, outperforming traditional discrete grid approaches.
InfiniDepth is a neural architecture for monocular depth estimation that models depth as a continuous implicit field, enabling arbitrary-resolution and fine-grained depth prediction. By replacing traditional discrete image-grid prediction with neural implicit fields, InfiniDepth provides scalability to any output resolution, addresses fine-detail recovery, and facilitates high-quality geometric representations suitable for novel-view synthesis. The method introduces a multi-scale local implicit decoder to efficiently query depth at continuous coordinates and demonstrates state-of-the-art results across synthetic and real-world benchmarks with particular strength in high-frequency, detail-rich regions (Yu et al., 6 Jan 2026).
1. Mathematical Formulation: Neural Implicit Depth Function
InfiniDepth establishes depth estimation as regression over a continuous 2D domain. The general form of an implicit field is: where denotes a multi-layer perceptron (MLP). For monocular depth estimation conditioned on an image , the mapping becomes: Here, maps an input RGB image and a continuous coordinate to a scalar depth value. The final depth prediction is expressed as: where is a locally fused feature vector derived through hierarchical multi-scale aggregation.
2. Multi-Scale Local Implicit Decoder Architecture
Input Encoding
- Input: RGB image
- Encoding: Processed using a Vision Transformer (ViT-Large, DINOv3)
- Feature Extraction: From transformer layers 4, 11, and 23, token maps are projected to feature maps at resolutions
with , , and channels .
Continuous Feature Query and Fusion
- For a continuous coordinate , corresponding coordinates at each scale are mapped as:
- Local neighborhoods ( area) are bilinearly interpolated to produce local descriptors .
- Hierarchical fusion proceeds as:
- : Projects to channel space
- : Learnable channel-wise gate ()
- : Two-layer MLP block, expansion factor 4, ReLU activation, residual connection
- The final fused feature advances through a 3-layer MLP (input 1024, hidden 256, ReLU activations, terminal ELU) to yield the scalar depth value.
3. Training Objective and Loss Functions
Supervision is performed on random continuous coordinate–depth tuples: with primary objective: where . Ground-truth depths are normalized to log-space as: with the 2nd and 98th percentiles of per-image.
For the monocular model, no additional gradient or smoothness regularization is applied. In the context of novel-view synthesis with the Gaussian-Splatting head, an additional RGB reconstruction loss and LPIPS perceptual loss are incorporated.
4. Inference and Arbitrary-Resolution Depth Querying
Depth inference does not require fixed-grid upsampling or convolution, enabling pointwise continuous prediction:
- The input image is encoded at native resolution to generate feature pyramids .
- To produce a depth map at arbitrary resolution (e.g., 3840×2160), each output point is:
- Mapped to each scale
- Used to bilinearly interpolate local features
- Processed through hierarchical fusion into
- Passed to the MLP head for
For novel-view synthesis:
- The implicit field is sampled on a grid, and depths are backprojected to 3D.
- Per-pixel surface area weights are computed:
where is the normal (via autograd) and the view vector.
- Surface samples are drawn by inverse-CDF, then re-queried for a dense, uniform point cloud suitable for Gaussian splatting.
5. The Synth4K Benchmark
Benchmarking is conducted on the Synth4K dataset, comprising 4K-resolution frames (3840×2160) from five AAA games with diverse scene geometry and appearance:
| Subset | Game Title |
|---|---|
| Synth4K-1 | CyberPunk 2077 |
| Synth4K-2 | Marvel’s Spider-Man 2 |
| Synth4K-3 | Miles Morales |
| Synth4K-4 | Dead Island |
| Synth4K-5 | Watch Dogs |
Each subset contains hundreds of frames with varied content (indoor/outdoor, lighting, geometry). A high-frequency mask for evaluating fine-detail performance is constructed by applying multi-scale Laplacian energy operators on the depth map, normalizing via the 98th percentile, and sharpening with an exponent . Top-k energy pixels define mask regions for detailed evaluation.
Metrics employed include:
- Relative depth: ,
- Metric depth: for
Zero-shot protocol is enforced: all models are tested without fine-tuning. Test images are input at 504×896; for 4K output, baselines are bilinearly upscaled, whereas InfiniDepth is queried directly at the target resolution.
6. Experimental Results and Significance
Zero-Shot Relative Depth Performance
- On Synth4K, InfiniDepth achieves superior scores across all subsets (; next best: 84–88%).
- In high-frequency regions representing fine details, InfiniDepth attains versus the second-best at 66.5%.
Real-World Benchmarks
- On datasets such as KITTI, ETH3D, NYUv2, ScanNet, and DIODE, InfiniDepth is top-3 on all and best on both ETH3D () and DIODE.
Metric Depth with Sparse LiDAR
- Synth4K with 1.5k sparse LiDAR: InfiniDepth+Prompt (Ours-Metric) produces (prior best: 65%). For fine details, (prior best: 21%).
- Real data: KITTI ( = 63.9% vs 58.3% for PromptDA), ETH3D (96.7% vs 92.8%).
Single-View Novel-View Synthesis
With Infinite Depth Query and a Gaussian-Splatting head, InfiniDepth yields syntheses with fewer holes and artifacts under large viewpoint changes than prior pixel-aligned methods (e.g., ADGaussian), as shown in qualitative examples (Yu et al., 6 Jan 2026).
7. Context and Implications
InfiniDepth replaces conventional grid-based dense prediction with a neural continuous field, directly modeling and leveraging multi-scale fusion for localized detail retrieval. This enables sub-pixel supervision and resolution-agnostic inference, with no explicit upsampling or convolution in the final stages. The approach attains state-of-the-art results both on synthetic 4K data and established real-world datasets, particularly excelling in geometric detail retrieval. A plausible implication is that this architectural paradigm can generalize to other per-pixel prediction tasks where spatial continuity and resolution flexibility are critical (Yu et al., 6 Jan 2026).