Cross360: 360° Depth Estimation via Cross-Attention
- Cross360 is a cross-attention-based deep learning architecture that fuses global equirectangular (ERP) context with distortion-free tangent patch features for 360° depth estimation.
- It addresses challenges like spherical distortion and spatial discontinuity using cross-projection feature alignment and progressive multi-scale feature aggregation.
- By leveraging CPFA and PFAA, Cross360 achieves state-of-the-art performance on benchmarks such as Matterport3D, Structured3D, and 3D60 with significant error reductions.
Cross360 refers to a cross-attention-based deep learning architecture for monocular 360° depth estimation that addresses the fundamental challenges of preserving global continuity and mitigating distortion in spherical images. By leveraging cross projections and progressive feature aggregation, Cross360 achieves state-of-the-art results on both real and synthetic 360° vision benchmarks, significantly outperforming previous methods that fuse local and global representations across projections (Huang et al., 24 Jan 2026).
1. Motivation and Problem Formulation
Monocular 360° depth estimation requires predicting per-pixel depth from a single panoramic image, typically in equirectangular projection (ERP). Existing representations for panoramic images either suffer from global spatial discontinuity (e.g., cube maps, tangent/local patches) or significant spherical distortion, especially near the poles (ERP). Methods that fuse ERP with local projections, such as tangent patches, typically use local patch features with limited global awareness and encounter feature misalignment at patch boundaries. Cross360 addresses these limitations by integrating global ERP context and local distortion-free patch detail through cross-attention-based feature alignment and multi-scale progressive feature fusion (Huang et al., 24 Jan 2026).
2. Network Architecture and Input Representations
Cross360 employs an encoder–decoder network with dual input branches:
- Equirectangular Image (ERP) Branch: Processes a 3×H×W ERP image using (a) a convolutional block at full resolution for fine detail and (b) a ResNet-34 backbone (downsampled) to produce hierarchical feature maps at S=5 scales.
- Tangent Patch (TP) Branch: Samples patches from the input ERP via gnomonic projection (fixed 72° FoV, non-uniform along latitude), each patch mapped to a tangent plane. Local features are computed via a shared embedding layer followed by multi-head self-attention, producing for each patch.
Features from both branches are fused at each decoder scale via the Cross Projection Feature Alignment (CPFA) module. ERP captures global 360° context but is distorted, while TP offers local distortion-free features; Cross360 ensures both representations inform the depth estimation at all spatial locations (Huang et al., 24 Jan 2026).
3. Cross Projection Feature Alignment (CPFA) and Progressive Feature Aggregation
Cross Projection Feature Alignment (CPFA)
At every decoder scale :
- ERP2TP and TP2ERP: TP patches are generated by projecting ERP pixels to tangent planes using gnomonic projection (with explicit geometric mapping between ERP and TP planes); after local feature processing, patch features are splatted back to ERP via inverse mapping and bilinear interpolation.
- Cross-Attention Alignment: Let and denote ERP and TP features at scale . Cross-attention is computed by forming queries from TP features and keys/values from ERP features:
Aligned features are re-projected, aggregated over all patches , and fused with ERP and skip-connection features.
Progressive Feature Aggregation with Attention (PFAA)
PFAA fuses multi-scale decoder outputs through channel attention and stage-wise aggregation. For each scale :
This design enables refinement of depth predictions holistically across multiple scales (Huang et al., 24 Jan 2026).
4. Loss Functions, Training Regimen, and Implementation
Training optimizes multi-scale depth and gradient losses with the following objectives:
- Depth Loss: Multi-scale mean-squared error or reverse Huber (BerHu) loss (dataset-dependent):
- Gradient Loss: Penalizes differences in depth gradients to encourage spatial smoothness:
- Total Loss:
Training details include ImageNet-pretrained ResNet-34 as the backbone, Adam optimizer ( learning rate), five-scale feature hierarchy, tangent patches (patch resolutions: 8×8, 16×16, 24×24), and typical ERP input size (Huang et al., 24 Jan 2026). Code and pretrained models are available at https://github.com/huangkun101230/Cross360.
5. Quantitative and Qualitative Performance
Cross360 demonstrates improved performance across standard 360° monocular depth estimation benchmarks—Matterport3D (M3D), Structured3D, 3D60—using Abs Rel, Sq Rel, RMSE, and as metrics. Representative results (Abs Rel, lower is better):
| Dataset | Cross360 Abs Rel | Prev. Best (Method) | Relative Gain |
|---|---|---|---|
| Matterport3D | 0.0955 | 0.1039 (SGFormer) | -8% |
| Structured3D | 0.0361 | 0.0520 (GLPanoDepth) | -31% |
| 3D60 | 0.0526 | 0.0617 (PanoFormer) | -15% |
Qualitatively, Cross360 recovers thin structures (e.g., handrails, window frames), maintains room layout coherence, and reduces pole artifacts compared to alternative fusion methods. Limitations are observed at long-range depth estimation (8 m) due to fewer training pixels (Huang et al., 24 Jan 2026).
Ablation studies show the importance of both CPFA (≈19% Abs Rel reduction) and PFAA (≈12% reduction) relative to baselines. Patch count controls a trade-off: fewer patches (10, 120° FoV) increase FPS (∼15 FPS) with reduced accuracy; more patches (46, 60° FoV) give higher accuracy (∼6 FPS); used as the most balanced configuration.
6. Comparison with Related Approaches
Cross360 is situated among a family of cross-projection fusion methods for omnidirectional depth estimation, including BiFuse, UniFuse, OmniFusion, HRDFuse, and Elite360D. Key comparative points:
- Cross360 vs. BiFuse/UniFuse: Cross360 employs cross-attention for fine-grained feature alignment between distortion-free local patches (TP) and globally continuous (but distorted) ERP, whereas BiFuse/UniFuse utilize geometric re-projection and rely on local ERP pixel-patch correspondences, which may lack global awareness in patch features (Ai et al., 2024).
- Cross360 vs. Elite360D: Elite360D replaces tangent patches with ICOSAP (icosahedron-subdivision point sets), exploits both semantic- and distance-aware dual attention, and achieves similar or improved accuracy with less computational cost (≈1M parameter overhead), but does not incorporate geometric re-projection. Results suggest both CPFA-based and ICOSAP-based fusion effectively handle the distortion–continuity duality in 360° images, but through distinct strategies (Ai et al., 2024).
7. Limitations, Extensions, and Reproducibility
Cross360 retains certain constraints common to projection-fusion methods. At very long depth ranges, estimation quality decreases due to sparse per-pixel supervision. ERP-Tangent fusion requires tuning patch count and field-of-view according to image resolution and pole coverage. The method is robust to missing ERP regions (e.g., incomplete vertical FoV), for which patch count is adaptively reduced.
Code and pretrained models for Cross360 are public, facilitating reproducibility and comparison. For future research, combining CPFA with more expressive attention mechanisms (e.g., as in Elite360D’s B2F module or semantic graph construction) is a plausible direction. Expanding datasets and evaluation metrics for full-sphere panoramas will further benefit the assessment of such architectures (Huang et al., 24 Jan 2026, Ai et al., 2024).