DP-ER: Dual-Projection Fusion for 360° & LiDAR

Updated 10 November 2025

DP-ER is a fusion paradigm that unifies multiple geometric projections to overcome the limitations of single-stream models in capturing both local detail and global context.
It leverages complementary strengths by aligning learned features from projections such as equirectangular and cubemap for 360° imagery, or spherical and BEV for LiDAR data.
Empirical results demonstrate significant improvements in depth estimation and semantic segmentation benchmarks, while maintaining computational efficiency.

Dual-Projection Fusion (DP-ER) is a paradigm for integrating complementary representations from two geometric projections of a signal—such as equirectangular and cubemap or spherical mesh projections for 360° imagery, or spherical and bird’s-eye view projections for LiDAR data—in a single neural model. The objective is to leverage the distinct advantages of each projection—local detail, geometric fidelity, global context, or lack of distortion—by fusing their learned features or predictions, thereby overcoming the limitations inherent to any single projection stream.

1. Motivation and Theoretical Foundations

The theoretical impetus for Dual-Projection Fusion lies in the trade-off between local high-frequency detail and global geometric consistency. Equirectangular projection (ERP) for panoramic imagery is efficient for 2D convnets but suffers severe distortion near poles; cubemap or icosahedral mesh projections alleviate geometric discontinuity at the expense of continuity or locality. For 3D point clouds, spherical projection provides angular structure while Cartesian bird’s-eye view (BEV) projections preserve planar relationships and height information.

Single-projection models have demonstrated systematic weaknesses: ERP-based approaches miss large-FoV context due to limited receptive field and pole distortion; mesh-based or BEV models often lack spatially localized features. DP-ER unifies these complementary perspectives, yielding richer feature representations and more accurate dense predictions, as shown by consistent statistical improvements in depth estimation and semantic segmentation benchmarks (Ai et al., 25 Mar 2024, Alnaggar et al., 2020, Yan et al., 9 Feb 2025, Wang et al., 2022).

2. Network Architectures for Dual-Projection Fusion

DP-ER instantiations employ two parallel branches corresponding to the chosen projections. Typical pipelines include:

360 Depth Estimation:
- ERP/2D Branch: Processes equirectangular panoramas via standard CNN backbones (e.g., ResNet-34, ResNet-50, Transformer), yielding spatial feature maps $F^E\in\mathbb{R}^{h\times w\times C}$ .
- Secondary Branch: Processes an alternative projection, such as:
- Cubemap: Six $w\times w$ face images, processed by a CNN, features aligned via Cube-to-Equirectangular (C2E) mapping.
- Icosahedral Mesh (ICOSAP) or Spherical Mesh: Point cloud (icosahedron face-centers) or subdivided mesh processed by mesh-based encoders (e.g., ResNet-18 with mesh convolutions, or Point Transformer), features $F^I\in\mathbb{R}^{N\times C}$ .
3D LiDAR Semantic Segmentation:
- Spherical Branch: Projects point cloud onto $H\times W$ range image (channels: $x, y, z, r, \text{remission}$ ), processed by a lightweight CNN (e.g., MobileNetV2).
- BEV Branch: Projects onto a 2D ground-plane grid (channels: $x, y, z, \text{remission}$ ), processed by a small U-Net.

Fused features are then decoded by shared or specialized decoders (e.g., UNet-style, mesh-UNet) to output the desired prediction (depth map, per-point labels).

Projection 1	Projection 2	Encoder Type(s)	Typical Decoder
ERP (360 image)	Cubemap / ICOSAP	ResNet / PointNet / MeshConv	UNet/mesh-UNet
Spherical img	BEV grid	MobileNet / U-Net	U-Net

3. Fusion Mechanisms: Mathematical Formulations and Attention

A defining attribute of DP-ER is the cross-projection fusion module, which aligns and merges features from the two branches, often at multiple network depths.

Fusion strategies include:

Post-hoc Soft Voting (LiDAR): After running both branches to per-point predictions, per-point softmax scores from each projection are summed to yield final class assignment:

$S^\text{final}_i = S^\text{sph}_i + S^\text{bev}_i, \quad \hat y_i = \arg\max_c (S^\text{final}_i[c])$

where $S^\text{sph}_i$ , $S^\text{bev}_i$ are softmax scores at projected pixel for point $i$ (Alnaggar et al., 2020).

Gated or Attention-based Feature Fusion (Depth Estimation):
- Bi-projection Bi-attention Fusion (B2F): Combines semantic-aware and distance-aware attention:
- \begin{align*}
- Q^S_{i,j} &= F^{E_{i,j}W^S_Q,\quad} K^S = F^I W^S_K,\quad V^S = F^I W^S_V \
- A^S_{i,j} &= \mathrm{softmax}\left(\tfrac{Q^{S_{i,j}(K^{S)^{\top}{\sqrt{d}}\right)\}}}
- F^{SA}_{i,j} &= A^S_{i,j} V^S \
- \text{(distance-aware attention)}\ & \cdots \
- F^{GL} &= g^{SA} \odot F^{SA} + g^{DA} \odot F^{DA}
- \end{align*}
- with gating terms derived from fused features (Ai et al., 25 Mar 2024).
- GateFuse: At each mesh triangle (SphereFusion), concatenate mesh and projected CNN features, compute gates $r, z$ via $1\times 1$ convs + sigmoid:
$F^\text{fused}_i = r \odot F^\text{sph}_i + z \odot \widehat{F}^\text{equi}_i$

(Yan et al., 9 Feb 2025). - Residual Correction Fusion: Concatenate aligned feature tensors; update each branch with residuals predicted from both branches; fuse features into UNet decoder:

$f_{\rm equi}' = f_{\rm equi} + \Delta_{\rm equi},\quad f_{\rm cube}' = \text{E2C}(\text{C2E}(f_{\rm cube}) + \Delta_{\rm cube})$

$f_{\rm fuse} = H_f([f_{\rm equi}, \text{C2E}(f_{\rm cube})])$

(Wang et al., 2022).

These mechanisms adaptively weight projection streams at each spatial location or mesh triangle, guided by semantic similarity, spatial proximity, or learned global context.

4. Projection Transformations and Geometric Alignment

Essential to DP-ER is precise geometric mapping between projections for both data and intermediate feature tensors:

For 2D-to-3D Alignment (ERP/icosahedral mesh):
- ERP pixel $(i,j)$ mapped to spherical coordinates,
$\theta = 2\pi \frac{j}{W} - \pi, \quad \phi = \frac{\pi}{2} - \pi \frac{i}{H}$

Cartesian: $(x, y, z) = (\cos\phi\cos\theta,\, \cos\phi\sin\theta,\, \sin\phi)$ - ICOSAP or mesh node center $(x_n, y_n, z_n)$ . - Features or RGB values interpolated between grid and mesh via bilinear/nearest-neighbor sampling (Ai et al., 25 Mar 2024, Yan et al., 9 Feb 2025).
For Cubemap–Equirectangular Transform:
- Standard sphere-face mappings (C2E, E2C), ensures corresponding feature locations (Wang et al., 2022).
LiDAR Point Cloud Projections:
- Spherical View: $(x, y, z)\to (\theta, \phi, r)$ , discretized to $H\times W$ image.
- BEV: $x, y$ sorted into grid cells, highest $z$ per cell kept.

Rigorous projection ensures the anatomical consistency and enables location-aware fusion of features.

5. Losses, Training Schemes, and Regularization

DP-ER methods are trained with projection-appropriate losses and domain-specific regularization:

360° Depth Estimation:
- Supervised: Reverse Huber (berHu) loss over pixels or mesh triangles.
- Self-supervised (BiFuse++): Contrast-Aware Photometric Loss (CAPL) weights photometric error by local image contrast and occlusion mask,
$\mathcal{L}_\text{CAPL}^s = \sum_{p} X_s(p)\, \sigma(I_{t,s}(p))\, \delta(I_{t,s}(p))$

Regularized by occlusion mask entropy and depth smoothness (Wang et al., 2022).
LiDAR Segmentation:
- Branch losses: Focal + Lovasz-Softmax (spherical), cross-entropy + Lovasz-Softmax (BEV).
- Fusion: Explicit label fusion based on softmax score addition (Alnaggar et al., 2020).
Training Protocols: Stochastic gradient descent or Adam, batch sizes dictated by projection/branch, cosine annealing or cyclical learning rates. Data augmentation typically includes rotations, flips, scaling, and domain-specific distortions in both projection streams.

6. Empirical Results and Ablations

Dual-Projection Fusion achieves superior performance on established benchmarks across modalities:

LiDAR Semantic Segmentation (SemanticKITTI): DP-ER (as "MPF") attains mean IoU of 55.5 with 3.18 M parameters at 20.6 scans/sec, outperforming RangeNet++ (52.2 mIoU, 50 M params, 12.8 scans/sec) and PolarNet (54.3 mIoU, 13.6 M, 6.7 scans/sec). Fusion increases mIoU by 5–14 points over single-projection models; robustness persists for both near and far points (>30 m) (Alnaggar et al., 2020).
Panoramic Depth Estimation:
- BiFuse++: RMSE drops from 0.5421 (equi-only) to 0.4321 (DP-ER) for self-supervised PanoSUNCG; from 0.7643 (OmniDepth, equi-only) to 0.5190 (DP-ER) on Matterport3D. Under camera-rotation, DP-ER attains higher invariance (Wang et al., 2022).
- Elite360D (Bi-attention DP-ER): Absolute Relative Error drops from 0.1255 to 0.1115, RMSE from 0.5173 to 0.4875, with only 1 M extra parameters beyond ERP baseline; achieves up to 35.1% relative RMSE reduction on Structure3D (Ai et al., 25 Mar 2024).
- SphereFusion: Achieves 0.0899 MRE and 0.1654 MAE on Stanford2D3D, with 17 ms per 512×1024 panorama, outperforming most state-of-the-art in both speed and RMSE (Yan et al., 9 Feb 2025).

Ablations confirm both branches contribute non-redundant information: removing either stream in SphereFusion increases error (MRE: 0.0461 for 2D-only, 0.0572 for mesh-only, vs. 0.0417 for DP-ER). Gated or bi-attention fusion modules outperform shallow or unidirectional fusion.

7. Implementation Complexity, Efficiency, and Extensions

DP-ER models are efficient relative to prior cross-projection schemes. For example, Elite360D’s ERP+ICOSAP DP-ER architecture (ResNet-34 backbone, 25.5 M params, 65.3 G FLOPs) requires only ~1 M extra parameters over the ERP-only baseline, compared to 52 M in BiFuse and 50 M in UniFuse with higher computational cost (Ai et al., 25 Mar 2024). In SphereFusion, real-time inference at 17 ms is supported by caching mesh adjacency structures across the multi-scale encoder/decoder, reducing overhead for mesh operations (Yan et al., 9 Feb 2025).

DP-ER is projection-agnostic: architectural and fusion modules can accommodate new surface samplings, non-uniform meshes, or further projective views, as suggested in future work on end-to-end multi-view fusion (Alnaggar et al., 2020).

Dual-Projection Fusion constitutes a family of techniques uniting local and global context through geometrically dual projections, modular cross-projection attention or gating mechanisms, and explicit geometric alignment. It consistently achieves higher fidelity predictions with lower computational cost on both scalable dense prediction and pointwise semantic segmentation tasks, and is extensible to further modalities and fusion strategies.