SphereFusion: 360° Depth Estimation
- SphereFusion is an end-to-end framework that combines dual representations—equirectangular images and spherical meshes—to enhance 360° depth estimation.
- The architecture employs a GateFuse module to integrate features adaptively, reducing distortion and preserving fine details.
- Real-time performance is achieved through multi-scale supervision and mesh operation caching, enabling ~60 FPS on high-end GPUs.
SphereFusion is an end-to-end framework for 360° panorama depth estimation that integrates both equirectangular and spherical mesh representations, leveraging a gated fusion mechanism to combine their respective strengths. It is designed to overcome the distortions and detail loss associated with conventional projection, mesh-only, or hybrid patch methods by performing feature fusion directly in the spherical domain. SphereFusion achieves real-time performance with competitive accuracy on standard datasets through architectural innovations such as mesh operation caching and multi-scale supervision (Yan et al., 9 Feb 2025).
1. Methodology Overview
SphereFusion operates on a single 360° RGB panorama, typically in equirectangular format with resolution 512×1024. Its architecture consists of the following sequential stages:
- Dual Representation: The input panorama is simultaneously processed in two domains. In the equirectangular domain, the image is passed through a ResNet-50 2D Convolutional Neural Network (CNN) encoder. In the spherical domain, the panorama textures an icosahedral subdivision mesh through loop subdivision at subdivision level MR, associating mesh faces with corresponding image directions.
- Feature Extraction: The 2D-CNN encoder (E₂D) produces five multi-scale feature maps (…) with progressive channel depths (64 to 2048). Concurrently, a lightweight ResNet-18–style mesh-CNN encoder (MS) extracts spherical features (…) at the corresponding mesh resolutions.
- Projection and Gated Fusion: Each 2D feature map is resampled onto the mesh using the E2S operator, mapping equirectangular features onto the sphere. At every scale, a GateFuse module fuses and , producing combined features .
- Mesh Decoder: Using a U-Net–style mesh-CNN with skip connections, mesh-unpooling progressively upsamples the fused features to the original mesh resolution. The final depth predictions are optionally reprojected via S2E to equirectangular format for evaluation.
SphereFusion employs a multi-scale BerHu regression loss, and at inference, a mesh adjacency caching strategy accelerates processing to approximately 17 ms per 512×1024 panorama (or ≈60 FPS on a single RTX 3090 GPU) (Yan et al., 9 Feb 2025).
2. Mathematical Formulations and Core Operations
Several key mathematical mechanisms underpin SphereFusion:
- Projections:
- Equirectangular to Sphere: For pixel ,
- Mesh Face Center to Spherical Coordinates:
- E2S Sampling (Projecting equirect image to mesh):
$\begin{cases} u = \left(1 + \frac{\atan2(y, x)}{\pi}\right)\frac{W}{2} \ v = \left(0.5 + \frac{\atan2(z, \sqrt{x^2+y^2})}{\pi}\right)H \end{cases}$
- Mesh Operations:
- Mesh Convolution: For each triangle, with center feature and neighbors :
Mesh pooling aggregates features of grouped triangles; mesh unpooling applies loop subdivision for upsampling. - Adjacency Caching: By caching Face-Adjacent-Face (FAF) connectivity across all mesh subdivision levels during initialization, mesh operation overhead is reduced from to at runtime.
- GateFuse Module: At each scale, spherical features and projected equirect features are fused via learned reset and update gates:
The gating mechanism adaptively balances the contributions of mesh-based and image-based features.
- Loss Function: The BerHu regression loss at finest scale,
with threshold . Multi-scale supervision applies the loss at scales:
where are scale weights.
3. Architectural Configuration
- Encoders: The image branch leverages a ResNet-50 backbone, producing multi-scale feature maps at channel sizes 64, 256, 512, 1024, and 2048. A lightweight decode head standardizes channel numbers to match those of the mesh branch: 64, 64, 128, 256, and 512.
- Mesh Branch: A ResNet-18–style mesh CNN processes the icosahedral mesh through five subdivision levels, yielding features at increasing resolution (triangles count: at level ).
- Feature Fusion & Decoder: After GateFuse, the fused spherical feature and the mesh encoder’s native feature (both channels) are concatenated into $2C$-channel inputs for each mesh-unpooling stage in the decoder. Multi-scale depth outputs at each level support deep supervision.
- Parameter and FLOP Count: Total parameter count ≈38M; operations per panorama (512×1024) ≈120 GFLOPs. Mesh adjacency caching reduces per-layer overhead by over 50%.
| Branch/Module | Parameters (M) | Function |
|---|---|---|
| Image CNN/RN50 | 25 | Equirectangular feature encoding |
| Mesh CNN/RN18 | 11 | Spherical mesh feature encoding |
| Fusion+Decoder | 2 | Fusion/mesh-decoder, upsampling |
4. Complexity Analysis and Inference Runtime
SphereFusion exhibits competitive efficiency relative to baseline approaches. The use of mesh operation caching is critical in enabling high-throughput inference, resulting in approximately 17 ms per panorama image (512×1024), which corresponds to ~60 FPS on a high-end GPU. In comparative terms:
- SphereDepth (mesh-only) requires ~61 ms per image.
- BiFuse/UniFuse hybrid patch methods span 25–80 ms.
- PanoFormer approaches 125 ms per image.
The principal efficiency gain arises from the reuse of mesh adjacency lists, minimizing redundant computation during mesh convolutions and unpooling. The combined parameter and FLOP footprint further position SphereFusion as a scalable approach for large-scale or real-time applications.
5. Benchmark Evaluation
SphereFusion demonstrates state-of-the-art or near–state-of-the-art quantitative and qualitative performance on major 360° depth datasets:
| Dataset | MRE | MAE | RMSE | RMSE(log) | δ₁ | Inference Time (s/im) |
|---|---|---|---|---|---|---|
| Stanford2D3D | 0.0899 | 0.1654 | 0.3194 | 0.0611 | 0.9257 | 0.017 |
| Matterport3D | 0.1145 | 0.2852 | 0.4885 | 0.0733 | — | 0.017 |
| 360D | 0.0417 | 0.0894 | 0.1813 | 0.0286 | — | 0.015 |
SphereFusion’s accuracy metrics (Mean Relative Error (MRE), Mean Absolute Error (MAE), RMSE, RMSE(log), and δₙ accuracy thresholds) are at or near the leading edge on each dataset. On qualitative evaluation:
- Equirectangular or cubemap-only baselines suffer from distortion and discontinuity, particularly near the poles.
- Mesh-only models (e.g., SphereDepth) retain global room geometry but degrade at object boundaries.
- SphereFusion’s GateFuse module combines the distortion-free layout of mesh representation with high-frequency texture detail from image-based convolutions, producing finer boundaries and fewer artifacts at patch borders.
- Point-cloud reconstructions (e.g., via Meshlab) present fewer holes and crisper geometry than tangent-patch or mesh-only fusion strategies (Yan et al., 9 Feb 2025).
6. Relevance, Implications, and Context
SphereFusion unifies practices previously pursued independently—mesh-based and equirectangular-based depth estimation—by explicit fusion in the spherical domain. The approach systematically addresses the prominent challenges in panoramic depth estimation: projection-induced distortion, seam discontinuity, and detail loss. Its efficiency is further enhanced by algorithmic innovations in mesh processing, supporting real-time deployment in robotics and navigation scenarios.
A plausible implication is that similar fusion strategies—melding complementary geometric and image representations with learnable gating—can generalize to related inverse problems on many non-Euclidean signal domains.
7. Related Work and Distinctions
SphereFusion builds upon and surpasses mesh-only (SphereDepth), patch-based hybrid (BiFuse/UniFuse), and transformer-based (PanoFormer) baselines through its combination of:
- Direct gating-based feature integration in the spherical domain,
- Multi-scale mesh and image encoding with supervision,
- Systematic mesh operation acceleration via caching.
Its performance suggests a new paradigm for 360° understanding, shifting focus from exclusively mesh or convolutional perspectives to hybrid architectures with domain-aware fusion (Yan et al., 9 Feb 2025).