InfiniDepth: Neural Implicit Depth Estimation
- InfiniDepth is a neural architecture that estimates monocular depth using continuous implicit fields, allowing for arbitrary-resolution predictions.
- It employs a multi-scale local implicit decoder to hierarchically fuse features from different resolutions, achieving state-of-the-art accuracy on synthetic and real-world benchmarks.
- The design facilitates high-quality geometric representations for novel-view synthesis and fine-detail recovery, outperforming traditional discrete grid approaches.
InfiniDepth is a neural architecture for monocular depth estimation that models depth as a continuous implicit field, enabling arbitrary-resolution and fine-grained depth prediction. By replacing traditional discrete image-grid prediction with neural implicit fields, InfiniDepth provides scalability to any output resolution, addresses fine-detail recovery, and facilitates high-quality geometric representations suitable for novel-view synthesis. The method introduces a multi-scale local implicit decoder to efficiently query depth at continuous coordinates and demonstrates state-of-the-art results across synthetic and real-world benchmarks with particular strength in high-frequency, detail-rich regions (Yu et al., 6 Jan 2026).
1. Mathematical Formulation: Neural Implicit Depth Function
InfiniDepth establishes depth estimation as regression over a continuous 2D domain. The general form of an implicit field is: where denotes a multi-layer perceptron (MLP). For monocular depth estimation conditioned on an image , the mapping becomes: Here, maps an input RGB image and a continuous coordinate to a scalar depth value. The final depth prediction is expressed as: where is a locally fused feature vector derived through hierarchical multi-scale aggregation.
2. Multi-Scale Local Implicit Decoder Architecture
Input Encoding
- Input: RGB image
- Encoding: Processed using a Vision Transformer (ViT-Large, DINOv3)
- Feature Extraction: From transformer layers 4, 11, and 23, token maps are projected to feature maps at resolutions
with , 0, 1 and channels 2.
Continuous Feature Query and Fusion
- For a continuous coordinate 3, corresponding coordinates at each scale are mapped as:
4
- Local neighborhoods 5 (6 area) are bilinearly interpolated to produce local descriptors 7.
- Hierarchical fusion proceeds as:
8
- 9: Projects 0 to channel space 1
- 2: Learnable channel-wise gate (3)
- 4: Two-layer MLP block, expansion factor 4, ReLU activation, residual connection
- The final fused feature 5 advances through a 3-layer MLP (input 1024, hidden 256, ReLU activations, terminal ELU) to yield the scalar depth value.
3. Training Objective and Loss Functions
Supervision is performed on random continuous coordinate–depth tuples: 6 with primary objective: 7 where 8. Ground-truth depths are normalized to log-space as: 9 with 0 the 2nd and 98th percentiles of 1 per-image.
For the monocular model, no additional gradient or smoothness regularization is applied. In the context of novel-view synthesis with the Gaussian-Splatting head, an additional 2 RGB reconstruction loss and LPIPS perceptual loss are incorporated.
4. Inference and Arbitrary-Resolution Depth Querying
Depth inference does not require fixed-grid upsampling or convolution, enabling pointwise continuous prediction:
- The input image is encoded at native resolution to generate feature pyramids 3.
- To produce a depth map at arbitrary resolution (e.g., 3840×2160), each output point 4 is:
- Mapped to each scale 5
- Used to bilinearly interpolate local features 6
- Processed through hierarchical fusion into 7
- Passed to the MLP head for 8
For novel-view synthesis:
- The implicit field is sampled on a grid, and depths are backprojected to 3D.
- Per-pixel surface area weights are computed:
9
where 0 is the normal (via autograd) and 1 the view vector.
- Surface samples are drawn by inverse-CDF, then re-queried for a dense, uniform point cloud suitable for Gaussian splatting.
5. The Synth4K Benchmark
Benchmarking is conducted on the Synth4K dataset, comprising 4K-resolution frames (3840×2160) from five AAA games with diverse scene geometry and appearance:
| Subset | Game Title |
|---|---|
| Synth4K-1 | CyberPunk 2077 |
| Synth4K-2 | Marvel’s Spider-Man 2 |
| Synth4K-3 | Miles Morales |
| Synth4K-4 | Dead Island |
| Synth4K-5 | Watch Dogs |
Each subset contains hundreds of frames with varied content (indoor/outdoor, lighting, geometry). A high-frequency mask for evaluating fine-detail performance is constructed by applying multi-scale Laplacian energy operators on the depth map, normalizing via the 98th percentile, and sharpening with an exponent 2. Top-k energy pixels define mask regions for detailed evaluation.
Metrics employed include:
- Relative depth: 3, 4
- Metric depth: 5 for 6
Zero-shot protocol is enforced: all models are tested without fine-tuning. Test images are input at 504×896; for 4K output, baselines are bilinearly upscaled, whereas InfiniDepth is queried directly at the target resolution.
6. Experimental Results and Significance
Zero-Shot Relative Depth Performance
- On Synth4K, InfiniDepth achieves superior 7 scores across all subsets (8; next best: 84–88%).
- In high-frequency regions representing fine details, InfiniDepth attains 9 versus the second-best at 66.5%.
Real-World Benchmarks
- On datasets such as KITTI, ETH3D, NYUv2, ScanNet, and DIODE, InfiniDepth is top-3 on all and best on both ETH3D (0) and DIODE.
Metric Depth with Sparse LiDAR
- Synth4K with 1.5k sparse LiDAR: InfiniDepth+Prompt (Ours-Metric) produces 1 (prior best: 65%). For fine details, 2 (prior best: 21%).
- Real data: KITTI (3 = 63.9% vs 58.3% for PromptDA), ETH3D (96.7% vs 92.8%).
Single-View Novel-View Synthesis
With Infinite Depth Query and a Gaussian-Splatting head, InfiniDepth yields syntheses with fewer holes and artifacts under large viewpoint changes than prior pixel-aligned methods (e.g., ADGaussian), as shown in qualitative examples (Yu et al., 6 Jan 2026).
7. Context and Implications
InfiniDepth replaces conventional grid-based dense prediction with a neural continuous field, directly modeling 4 and leveraging multi-scale fusion for localized detail retrieval. This enables sub-pixel supervision and resolution-agnostic inference, with no explicit upsampling or convolution in the final stages. The approach attains state-of-the-art results both on synthetic 4K data and established real-world datasets, particularly excelling in geometric detail retrieval. A plausible implication is that this architectural paradigm can generalize to other per-pixel prediction tasks where spatial continuity and resolution flexibility are critical (Yu et al., 6 Jan 2026).