BinEgo-360 Challenge: Advanced 360º Vision

Updated 15 December 2025

BinEgo-360 Challenge is a comprehensive framework advancing egocentric 360º vision by integrating dense depth estimation, temporal action localization, multi-view reconstruction, and BEV mapping.
It leverages innovative techniques such as geometric feature fusion, transformer-based aggregation, and multi-task learning to address challenges like spherical distortions and rapid deployment.
Benchmark results demonstrate state-of-the-art performance on metrics (e.g., Abs Rel, mAP, IoU) across diverse datasets, ensuring robustness in real-world applications.

The BinEgo-360 Challenge is a competitive and methodological framework designed to advance algorithmic solutions for robust and efficient understanding of 360-degree egocentric visual data. It targets key problems in monocular dense depth estimation, temporal action localization, 360º object reconstruction, and related 3D scene representation tasks, often under constraints of limited views and rapid real-world deployment. The challenge emphasizes transparent benchmarking using equirectangular (ERP) or spherical camera data, with an explicit focus on passive, single-sensor processing and generalization beyond category labels.

1. Dense 360º Depth Estimation: The OmniFusion Paradigm

Accurate monocular depth estimation from 360º images is challenged by severe spherical distortions and the discrepancy between predictions from local perspective views and the global equirectangular domain. The OmniFusion pipeline provides a reference architecture for addressing this, decomposing a single ERP RGB image into $N$ overlapping tangent-plane perspective patches via closed-form gnomonic projection. Each patch is processed by a standard CNN encoder-decoder to produce local depth and confidence maps, which are reprojected and merged into a unified ERP depth map via confidence-weighted averaging.

OmniFusion incorporates geometry-aware feature fusion by embedding each pixel’s true 3D spherical location and tangent-center attributes via an MLP $G(\cdot):(\lambda,\phi,\rho,\lambda_c,\phi_c)\to\mathbb{R}^{64}$ , where $\lambda,\phi$ are longitude/latitude, $\rho$ is radius (initialized to 1), and $\lambda_c,\phi_c$ are patch center angles. This geometric vector is summed with the 2D image feature channel at each early encoder stage, enabling the model to correct patch-wise discrepancies due to varying viewpoint geometry.

A self-attention transformer operating on patch-wise tokens then enforces global, geometry-consistent feature integration, further promoting consistency across overlapping patch predictions. Iterative depth refinement is implemented: after an initial ERP depth output, the geometric embedding is reparameterized with estimated depth ( $\rho\leftarrow\hat D_e$ ) and the pipeline is rerun, yielding a refined depth map. This two-pass strategy yields an additional performance gain.

Quantitatively, OmniFusion achieves state-of-the-art results on the Stanford2D3D, Matterport3D, and 360D datasets, e.g., Abs Rel 0.0950 and RMSE 0.3474 on Stanford2D3D, surpassing BiFuse and UniFuse in all core metrics. Ablation studies confirm the critical impact of geometry fusion, transformer aggregation, and the iterative scheme for optimal monocular ERP depth prediction (Li et al., 2022).

2. Temporal Action Localization with Multi-task TSM Extensions

Temporal action localization (TAL) in egocentric video is addressed in the BinEgo-360 Challenge via an extended Temporal Shift Module (TSM), adapted for interval-wise, multi-class classification with an explicit background model. TSM’s efficient temporal modeling shifts a fraction $\alpha$ of feature channels forward/backward in time, formulated as: $Y_t[c]= \begin{cases} X_{t+1}[c], & 0\le c<\alpha C \ X_{t-1}[c], & \alpha C\le c<2\alpha C \ X_{t}[c], & 2\alpha C\le c<C \end{cases}$ with clamped temporal indices and typical $\alpha=1/8$ .

Videos are divided into fixed-length, non-overlapping intervals; features from each interval are pooled and classified among $N_{\text{action}}+1$ categories. At training, intervals with ≥50% overlap with ground-truth actions are assigned the respective label, others to background. Post-processing merges adjacent intervals with the same predicted class, scored by the maximum softmax probability.

Scene classification and TAL are trained jointly with separate heads sharing the TSM backbone; losses are combined as $L=L_{\text{TAL}} + \lambda L_{\text{scene}}$ ( $\lambda=1$ ). A weighted ensemble aggregates predictions from multiple model variants, averaging confidences proportional to each model’s validation mAP, and fusing temporally overlapping proposals.

This system secured first place in both initial and extended rounds, achieving 0.5631 private mAP (vs 0.4593), with ablations showing multi-task learning outperforms single-task (+3–4% mAP), and ensembling providing a further ≈1.3% increment in mAP. Optimal interval length was found at $t=0.5$ s, balancing temporal localization with per-interval stability (Duong et al., 12 Dec 2025).

3. Neural 360º Reconstruction from Sparse Views: ZeroRF

Sparse multi-view 360º object reconstruction is exemplified by methods such as ZeroRF, which forgoes pretraining and dense data in favor of per-scene optimization with fast convergence. ZeroRF represents the scene as a low-rank tensor-factorized grid (TensoRF-VM) and wraps all tensor components in an untrained deep image prior generator $G_\theta$ fed with fixed Gaussian noise. The system is trained via pure rendering loss: $\theta^* = \arg\min_\theta \sum_{r\in R} \|\hat C_\theta(r) - C(r)\|_2^2 + \lambda\|\theta\|_2^2$ with standard AdamW optimization, where $\hat C(r)$ is the NeRF-style rendered color along ray $r$ .

With only 4–6 calibrated images, ZeroRF can reconstruct a 360° neural radiance field in ∼2–5 minutes, matching or exceeding RegNeRF, DietNeRF, InfoNeRF, FreeNeRF, and FlipNeRF in PSNR, SSIM, and LPIPS, while operating orders of magnitude faster. For BinEgo-360, ZeroRF’s per-scene setup (camera sampling, factorization, training, and inference) aligns with warehouse-scale demands, providing real-time mesh extraction for manipulation planning without large-category pretraining biases (Shi et al., 2023).

4. Sequential Floor-Plan Estimation with Monocular 360º Inputs

The 360-DFPE pipeline exemplifies scalable floor-plan estimation from time-ordered streams of monocular 360º images. The system uses loosely-coupled modules: monocular Visual SLAM (for camera poses, up to unknown global scale) and a monocular 360º room-layout predictor (generating ceiling/floor boundaries in ERP space). Layout boundaries are registered to world coordinates by estimating a single global scale factor $s^*$ which minimizes 2D entropy of stacked projected point densities, found by: $s^* = \arg\min_s E(s),\quad E(s) = -\sum_{u,v} F_s(u,v) \log F_s(u,v)$ where $F_s$ is the density map at scale $s$ .

Room identification and tracking uses occupancy-based voting over camera densities, dynamically instantiating or updating room maps via thresholding. Final room shapes are refined using an iterative shortest-path algorithm (iSPA), alternating between coarse global path estimation and local fine-tuning, producing sharp Manhattan- or arbitrary-shaped geometries with high run-time and corner precision efficiency.

360-DFPE outperforms active-sensor methods on MP3D-FPE in room-IoU and corner recall, e.g., 72%/78.2% at IoU0.5, with 34 s/room run-time. Recommended adaptations for BinEgo-360 include dynamic scale estimation, integration with semantic and inertial cues, and parallelized iSPA refinement for online deployment (Solarte et al., 2021).

5. 3D Object Lifting and Novel View Synthesis from Egocentric or Sparse Views

Single-image or limited multi-view 360º object reconstruction can also be addressed via diffusion- and NeRF-based frameworks such as NeuralLift-360. This pipeline accepts an in-the-wild RGB input, removes background, estimates a monocular depth map, and fits a hash-grid NeRF via a photometric loss on the reference view. A text-conditional diffusion model (e.g., Stable Diffusion v1.4) provides a generative prior by encouraging rendered novel views to match the distribution induced by the reference image and text embedding, guided by a combination of ELBO and CLIP-based losses.

A depth-ranking loss ensures that NeRF’s geometry is consistent with relative depth, and additional regularizations enforce plausible surface norms and smoothness. Empirically, NeuralLift-360 achieves the lowest average CLIP distance ( $D_{\rm CLIP}=0.450$ ) on 8-object 360º benchmarks, surpassing DSNeRF, DietNeRF, and SinNeRF. Limiting factors remain in final resolution and handling of multi-object occlusion.

To adapt NeuralLift-360 for BinEgo-360, multi-frame photometric objectives and temporal smoothness penalties can be introduced, swapping the base NeRF for a streaming variant and augmenting training with synthetic egocentric trajectories (Xu et al., 2022).

6. Bird's-Eye-View Mapping from Single 360º Cameras

In contextually rich environments such as autonomous driving, the BinEgo-360 challenge benefits from datasets and benchmarks such as Dur360BEV. Utilizing a single Ricoh Theta S dual-fisheye camera aligned with high-resolution 128-channel LiDAR and RTK GNSS/INS, the Dur360BEV benchmark establishes spherical-image-to-BEV mapping via a staged projection pipeline.

Feature extraction begins with a backbone CNN (ResNet-101 or EfficientNet-b4), followed by a two-stage Feature Pulling process. Spherical 3D points are projected into the camera frame using polynomial-calibrated equations, and features are sampled by bilinear interpolation, assembled into a BEV grid. Coarse-to-fine sampling explicitly targets anchor pillars for improved segmentation, with a U-Net decoder producing the top-down logits.

To address the severe class imbalance in BEV segmentation, a focal loss is configured: $FL(p_t) = - (1-p_t)^\gamma \log p_t$ with optimal $\gamma$ found at 1–2, sharpening object boundaries and yielding up to 1.6% IoU gain. Dense Grid + ResNet-101 maximizes raw IoU, while Coarse/Fine + EfficientNet-b4 achieves nearly the same IoU at %%%%20 $\rho\leftarrow\hat D_e$ 21%%%% less parameter cost, achieving an IoU $_{100}$ of 32.6% (vehicle class). The methodology is directly applicable to monocular 360º BEV mapping tasks in the BinEgo-360 context, with potential extensions in temporal fusion, dynamic occlusion handling, and class-specific loss balancing (E et al., 2 Mar 2025).

Summary Table: Principal Methods for Key BinEgo-360 Tracks

Task	Representative Method	Reference (arXiv ID)
Depth Estimation	OmniFusion	(Li et al., 2022)
Temporal Action Localization	Multi-task TSM	(Duong et al., 12 Dec 2025)
Sparse 360º Reconstruction	ZeroRF	(Shi et al., 2023)
Floor-plan Estimation	360-DFPE	(Solarte et al., 2021)
Single-view 3D Lifting	NeuralLift-360	(Xu et al., 2022)
BEV Mapping	Dur360BEV Spherical-BEV	(E et al., 2 Mar 2025)

Each pipeline integrates specific geometric, temporal, or multi-modal strategies tailored to the challenges of omnidirectional egocentric vision, with a consistent focus on accuracy, efficiency, and minimal reliance on category-level pretraining. The convergence of dense geometry, sequence modeling, and multi-view neural representation defines the evolving frontier of the BinEgo-360 Challenge.