Bi-Grid Voxelization for Panoramic 3D Occupancy

Updated 10 November 2025

Bi-Grid Voxelization (BGV) is a dual-grid 3D modeling technique that uses parallel Cartesian and cylindrical-polar voxel grids to capture detailed near-field and uniform panoramic features.
The method mitigates quantization biases by fusing multi-scale geometric features and aligning voxel grids to balance local precision with global angular continuity.
BGV integrates into the OneOcc pipeline, enabling full-surround semantic scene completion crucial for legged and humanoid robotic navigation.

Bi-Grid Voxelization (BGV) is a 3D representation and feature-lifting technique developed for panoramic semantic occupancy prediction, specifically addressing quantization biases caused by the conflicting demands of dense near-field capture and uniform angular sampling required in 360° robot perception. BGV constructs and processes both a standard Cartesian and a cylindrical-polar voxel grid in parallel at multiple scales, allowing the system to capture high-fidelity spatial structures and uniform scene coverage simultaneously. This approach is integral to the OneOcc system for semantic scene completion with panoramic vision, enabling precise, full-surround environment modeling for legged and humanoid robots.

1. Motivation and Problem Setting

Robots with panoramic cameras—especially legged or humanoid platforms—require occupancy prediction systems capable of robustly modeling 3D free and occupied space for navigation. Equirectangular panoramas present a unique sampling dilemma: Cartesian voxel grids, though optimal for uniform spatial (x, y, z) capture near the robot, misalign with the annular, 360° camera field, leading to non-uniform sampling (over-sampling near field, under-sampling in far field) and artifacts at the azimuthal boundaries. Cylindrical-polar discretizations, defined in (r, φ, z) coordinates, maintain perfect angular continuity and uniformity across the panorama but sacrifice radial (near-field) detail critical for small obstacle detection and precise foot placement.

Relying solely on one grid structure necessitates a trade-off: either accept coarse far-field resolution or inadequate near-field representation, resulting in jagged, aliased free/occupied boundaries and diminished task performance. BGV is designed to mitigate these issues by unifying the strengths of both grid types—near-field precision and global angular uniformity—through dual-voxelization and cross-stream feature fusion.

2. Coordinate Systems and Transformations

BGV operates simultaneously in two coordinate systems:

Cartesian (x, y, z): Right-handed world frame anchored to the robot. Suitable for uniform volume sampling and precise local geometry representation.
Cylindrical-polar (r, φ, z): Where $r = \sqrt{x^2+y^2}$ , $\varphi = \mathrm{atan2}(y, x)$ , and $z = z$ . This system aligns with the panoramic image's inherent (φ) structure and models radial distance explicitly.

Coordinate transformations are defined as: $\begin{cases} r = \sqrt{x^2 + y^2} \ \varphi = \mathrm{atan2}(y, x) \ z = z \end{cases}$ and inverse: $\begin{cases} x = r \cos\varphi \ y = r \sin\varphi \ z = z \end{cases}$ This dual system allows both local and angular correspondence between 3D world space and panoramic 2D feature representations.

3. Dual Voxel Grid Construction

Each input is lifted into two parallel 3D voxel grids per feature scale:

Cartesian voxel grid

Bounds: $x\in[x_{\min},x_{\max})$ , $y\in[y_{\min},y_{\max})$ , $z\in[z_{\min},z_{\max})$
Indices: $i = \left\lfloor (x-x_{\min})/\Delta x \right\rfloor$ , $j = \left\lfloor (y-y_{\min})/\Delta y \right\rfloor$ , $k = \left\lfloor (z-z_{\min})/\Delta z \right\rfloor$
Centroids: $x_i = x_{\min} + (i + 0.5)\Delta x$ , and similarly for $y_j$ , $z_k$

Cylindrical-polar voxel grid

Bounds: $r\in[r_{\min},r_{\max})$ , $\phi\in[\phi_{\min},\phi_{\max})$ (typically $\phi_{\min}=-\pi$ , $\phi_{\max}=+\pi$ ), identical z-bounds
Indices: $p = \left\lfloor (r - r_{\min})/\Delta r \right\rfloor$ , $q = \left\lfloor (\varphi-\varphi_{\min})/\Delta\varphi \right\rfloor$ , $k$ as above
Centroids mapped to world frame: $\mathbf{c}_{p,q,k}^{\mathrm{Po}} = (r_p\cos\varphi_q, r_p\sin\varphi_q, z_k)$ , with $r_p = r_{\min} + (p+0.5)\Delta r$ , $\varphi_q = \varphi_{\min} +(q+0.5)\Delta\varphi$

Both grid types are constructed at multiple scales (e.g., downsampled by {1, 4, 8, 16} relative to input resolution).

4. Feature Lifting, View2View Sampling, and Fusion

At each scale $s$ , BGV operates as follows:

Input acquisition: Two panoramic feature maps from the Dual-Projection fusion encoder: $F_{1/s}^{\mathrm{equi}}$ (equirectangular) and $F_{1/s}^{\mathrm{raw}}$ . These may be spatially warped by learned Gait Displacement Compensation offsets $\Delta_s = (d_x, d_y)$ .
Projection and sampling: For each voxel centroid $c$ (in both grids), project to each view $\mathbf{p} = \pi_v(c; \kappa)$ and sample by 2D bilinear interpolation at $\mathbf{p} + \Delta_s$ : $V^{\mathrm{Ca}_s}(c) = \mathrm{Bilinear}(F^v_{1/s}, \mathbf{p}+\Delta_s)$ and analogue for $V^{\mathrm{Po}_s}(c)$
Multi-scale fusion: Per-voxel, fuse across scales with convex weights $\alpha^{\mathrm{Ca}_s}, \alpha^{\mathrm{Po}_s}$ (sum to 1 over s): $V^{\mathrm{Ca}}(c) = \sum_s \alpha^{\mathrm{Ca}_s}(c) V^{\mathrm{Ca}_s}(c)$

$V^{\mathrm{Po}}(c) = \sum_s \alpha^{\mathrm{Po}_s}(c) V^{\mathrm{Po}_s}(c)$

Context injection: For each Cartesian voxel at level $\ell$ , its nearest neighbors in the polar grid are precomputed as index set $J_\ell$ . Polar features for these neighbors are aligned with a $1\times1\times1$ convolution and concatenated channel-wise with Cartesian features: $\widetilde{V}^{\mathrm{Ca}_\ell} = \left[\mathrm{Align}_{1\times1\times1}\left(V^{\mathrm{Po}_\ell}[J_\ell]\right)\;\Vert\;V^{\mathrm{Ca}_\ell}\right]$

Pseudocode:

for scale ℓ in {1,4,8,16}:
    for each voxel c in Ca_grid_ℓ:
        p = project(c) + Δ_ℓ
        Vca_ℓ[c] = bilinear_sample(F_{1/ℓ}, p)
    for each voxel c in Po_grid_ℓ:
        p = project(c) + Δ_ℓ
        Vpo_ℓ[c] = bilinear_sample(F_{1/ℓ}, p)
    Vca[c] = Σ_ℓ α^{Ca_ℓ}(c) · Vca_ℓ[c]
    Vpo[c] = Σ_ℓ α^{Po_ℓ}(c) · Vpo_ℓ[c]
    Ca_fused_ℓ = Concat( Align( Vpo_ℓ[J_ℓ] ), Vca_ℓ )

This mechanism allows dual-grid features to guide network attention toward the most reliable representation dynamically (local vs. angular/global).

5. Boundary Sharpening and Bias Mitigation

BGV achieves sharper free/occupied boundaries and reduces aliasing through complementary sampling densities:

Near-field: Cartesian grid supplies fine, uniform spatial resolution, supporting accurate detection of sharp transitions such as contact edges or obstacles immediately adjacent to the robot.
Far-field and angular continuity: Cylindrical grid maintains uniform sampling in azimuth ( $\phi$ ) and models radial distance directly, capturing elongated or circumferential structures without under-sampling edge regions.

Dual-injection—concatenation of aligned polar features into the Cartesian processing stream at every decoder stage—allows the model to specialize, dynamically leveraging the strengths of each grid depending on scene context. Importantly, BGV does not require additional boundary-aware losses; the sharpening is an emergent property of complementary grid samplings and convex multi-scale fusion. Boundary-aware mechanisms external to BGV (such as the AMoE-3D decoder’s gradient-energy gate) are handled at later stages, with BGV itself purely geometry- and sampling-driven.

6. Integration Within OneOcc Semantic Occupancy Pipeline

BGV is interposed in the OneOcc network as follows:

Inputs: Multi-scale 2D features from dual-projection encoders ( $\{F^{\mathrm{equi}}_{1/s}, F^{\mathrm{raw}}_{1/s}\}$ ), plus learned 2D motion warps $\Delta_s$ from Gait Displacement Compensation.
Outputs: Fused 3D Cartesian volumes with injected polar context ( $\text{Ca\_fused}_\ell \in \mathbb{R}^{C \times X \times Y \times Z}$ ) per decoder level, and optionally standalone polar 3D volumes for downstream use.
Flow of information: Each $\text{Ca\_fused}_\ell$ volume feeds into the corresponding Hierarchical AMoE-3D decoder stage, where channel/spatial attention and MoE fusion integrate multi-scale context. Final semantic occupancy prediction is performed by a 1×1×1 classification head on the fully fused volume.

The end-to-end sequence is summarized as: Panorama → unwrapping → dual 2D encoders → GDC warps → BGV (dual 3D grid lifting + cross-injection) → AMoE-3D fusion → segmentation head → discretized semantic occupancy.

This procedure allows BGV to systematically exploit both local and panoramic scene geometry, yielding aliasing-robust, high-fidelity 3D occupancy outputs without the need for specialized edge-awareness outside standard volumetric segmentation procedures.

7. Significance and Context

By jointly modeling Cartesian and cylindrical-polar discretizations, BGV enables legged or humanoid robots with panoramic monocular vision to obtain reliable 3D semantic occupancy in omnidirectional, jitter-prone environments, where single-grid voxelization frameworks would suffer from discretization bias and boundary artifacts. The BGV approach is distinct in managing spatial-angular trade-offs by explicit grid-wise injection and scale-wise convex fusion, which together account for both near-robot accuracy and holistic 360° scene continuity. This suggests broader relevance for robotics researchers aiming to unify multi-view geometry, spatial perception, and efficient volumetric prediction in full-surround contexts.

PDF Markdown Chat (Pro)

Follow Topic

Get notified by email when new papers are published related to Bi-Grid Voxelization (BGV).