Papers
Topics
Authors
Recent
Search
2000 character limit reached

OccAny: Unified 3D Occupancy Prediction

Updated 2 July 2026
  • OccAny is a 3D occupancy framework that generalizes across unconstrained urban, robotics, and panoramic scenarios using transformer-based vision backbones.
  • It integrates segmentation-driven regularization with test-time view synthesis to reduce geometric noise and sharpen object boundaries.
  • Its unified, calibration-free design achieves state-of-the-art IoU performance on diverse benchmarks by efficiently fusing multi-view tokens.

OccAny is a family of 3D occupancy prediction frameworks generalizing across unconstrained and out-of-domain urban, robotics, and panoramic scenarios. OccAny models dense 3D occupancy grids from monocular, sequential, surround-view, or panoramic images—without requiring in-domain pose calibration or sensor priors—by integrating transformer-based vision backbones, segmentation-driven regularization, and test-time view synthesis or geometry fusion. Its variants achieve state-of-the-art zero-shot urban scene occupancy and full-surround semantic occupancy on both vehicle-centric and legged robot benchmarks (Cao et al., 24 Mar 2026, Shi et al., 5 Nov 2025).

1. Unified Transformer-Based Architecture and 3D Representation

OccAny predicts a dense 3D occupancy grid O[0,1]X×Y×ZO \in [0,1]^{X \times Y \times Z}, corresponding to the metric probability of occupancy per voxel. This grid is constructed by voxelizing and integrating per-view “pointmaps”—per-pixel 3D predictions—composed from image sequences (monocular or multi-camera) via trilinear interpolation. The core framework can be interpreted as learning an implicit field f:R3[0,1]f: \mathbb{R}^3 \rightarrow [0,1] but relies, in practice, on explicit grid accumulation.

Modalities and Backbone

  • Input Modalities:
    • Sequential monocular: Nrec=6N_{rec} = 6–$10$ frames from a single camera.
    • Single-frame monocular: Nrec=1N_{rec} = 1.
    • Surround view: M6M \approx 6 synchronized cameras at a single timestep.
  • Shared Backbone: A 24-layer transformer encoder (EE) and decoder (DD) process all image input types (urban) or dual-projection encoders unroll panoramic images into annular/equirectangular views (in the legged-robotic variant).
  • Scene Memory: Cross-attention over a “scene memory” MM aggregates multi-view tokens to enable spatial and temporal fusion.
  • Task Tokens and Output Heads: Geometry (tgt_g) and segmentation (f:R3[0,1]f: \mathbb{R}^3 \rightarrow [0,1]0) task tokens guide decoding into per-view outputs, including global pointmaps (f:R3[0,1]f: \mathbb{R}^3 \rightarrow [0,1]1), image-local pointmaps (f:R3[0,1]f: \mathbb{R}^3 \rightarrow [0,1]2), confidence maps (f:R3[0,1]f: \mathbb{R}^3 \rightarrow [0,1]3), and segmentation features (f:R3[0,1]f: \mathbb{R}^3 \rightarrow [0,1]4).

2. Segmentation-Driven Geometry Completion and Forcing

Urban scene supervision (LiDAR) is sparse, especially in cluttered zones. OccAny regularizes occupancy predictions by segmenting feature alignment with a high-quality segmentation foundation model (SAM2 or similar).

  • Segmentation Forcing Loss: For each frame f:R3[0,1]f: \mathbb{R}^3 \rightarrow [0,1]5, the model penalizes the per-pixel L2 distance between predicted features f:R3[0,1]f: \mathbb{R}^3 \rightarrow [0,1]6 and the frozen SAM2 feature map f:R3[0,1]f: \mathbb{R}^3 \rightarrow [0,1]7, weighted by the learned geometric confidence f:R3[0,1]f: \mathbb{R}^3 \rightarrow [0,1]8:

f:R3[0,1]f: \mathbb{R}^3 \rightarrow [0,1]9

  • Effects: Enforcing this constraint reduces geometric noise in poorly observed regions, sharpens object boundaries, and enables semantic mask promptability. Segmentation forcing injects gradients that benefit the shared backbone, visible by improved edge sharpness and occupancy structure in ablation studies (Cao et al., 24 Mar 2026).

3. Novel View Rendering and Test-Time Augmentation

Geometry completion in OcAny is reinforced via test-time view augmentation, designed as a rendering pipeline that synthesizes novel viewpoints along predicted egomotion trajectories.

  • Pose Sampling: New camera centers are sampled every Nrec=6N_{rec} = 60 meters along the trajectory, at lateral offsets Nrec=6N_{rec} = 61 and yaw angles Nrec=6N_{rec} = 62.
  • Rendering Pipeline:
    • Merge reconstructed global pointmaps to form Nrec=6N_{rec} = 63.
    • Project Nrec=6N_{rec} = 64 to each novel camera frame and extract features to construct “novel-view tokens.”
    • Encode these via a lightweight transformer (6-layer encoder Nrec=6N_{rec} = 65) supervised to distill the original encoder's representations.
    • The decoder cross-attends to frozen scene memory Nrec=6N_{rec} = 66 and yields refined occupancy and segmentation outputs from synthetic views.
  • Consistency Objectives: Rendering-stage training comprises pointmap losses (Nrec=6N_{rec} = 67), segmentation forcing on rendered features, and encoder distillation (Nrec=6N_{rec} = 68).

4. Multi-Scale Semantics and Robotics Adaptations

In robotics (OneOcc), OccAny is extended for panoramic 360° and gait-resilient occupancy estimation:

  • Dual-Projection Fusion (DP-ER): Annular panoramic image is unwrapped (via polynomial mapping) to an equirectangular view, with both representations processed in parallel to preserve 360° continuity and mitigate polar/annular distortions.
  • Bi-Grid Voxelization (BGV): Features are lifted to both Cartesian and cylindrical-polar grids, fusing at each scale with per-voxel convex weights to balance near-field geometrical fidelity and panoramic global context.
  • Hierarchical AMoE-3D Decoder: A lightweight, depthwise-separable 3D UNet utilizes channel and spatial gating, mixture-of-experts, and gradient-energy gating to inject semantic gradients and preserve sharpness at different spatial resolutions.
  • Gait Displacement Compensation (GDC): Learns subpixel alignment corrections to rectify footfall-induced jitter without requiring inertial or proprioceptive measurements. This direct mapping of 2D features to 3D voxels is essential for agile robot operation (Shi et al., 5 Nov 2025).

5. Training Objectives and Loss Structure

OccAny jointly minimizes occupancy (geometry) and semantic (segmentation) losses, with additional objectives for rendered (synthetic) views:

  • Pointmap Loss:

Nrec=6N_{rec} = 69

(applied on both global and local pointmaps).

  • Segmentation Forcing: As above, applied with $10$0.
  • Encoder Distillation: Only in rendering, with $10$1.
  • Full Loss:

$10$2

6. Experimental Results and Generalization

Urban 3D Occupancy

  • Datasets: Mixed training on Waymo, DDAD, PandaSet, VKITTI2, ONCE; evaluation on SemanticKITTI (monocular/sequential) and Occ3D-NuScenes (surround view).
  • IoU (Sequence, SemanticKITTI): OccAny 25.91%, outperforming CUT3R*+TTVA (15.93%), VGGT+TTVA (15.20%), DA3 (15.76%) (Cao et al., 24 Mar 2026).
  • IoU (Monocular, SemanticKITTI): OccAny 24.03% vs. CUT3R*+TTVA 13.03%; OccAny surpasses OccNeRF (self-supervised, in-domain) at 22.81%.
  • IoU (Surround, Occ3D-NuScenes): OccAny 34.15%, besting baselines by over 10 points.

Semantic Occupancy

  • SemanticKITTI (Grounded-SAM2, mask mIoU): OccAny achieves mIoU 7.28% (mIoU_sc 13.53%), outpacing competitors (≤4.92% mIoU, ≤9.56% mIoU_sc).

Robotics: Panoramic, Gait-Resilient 3D Completion

  • QuadOcc (Real Quadruped, Robot): OneOcc achieves mIoU 20.56 on 64×64×8 voxel grids, outperforming LMSCNet (18.44) and MonoScene (19.19).
  • Human360Occ (Simulation): Within-city, mIoU 37.29 vs. MonoScene 33.46 (+3.83); cross-city, OneOcc achieves mIoU 32.23 vs. 24.15 (+8.08, ≈33.5% relative gain) (Shi et al., 5 Nov 2025).

Ablations and Generalization

Ablation studies confirm significant IoU drops when removing individual innovations (test-time view augmentation, encoder distillation, segmentation forcing, task tokens). OccAny’s zero-shot protocol outperforms domain-adaptive supervised SOTA benchmarks (e.g., CVT-Occ in cross-dataset settings) by large margins.

Setting OccAny IoU Best Baseline IoU
SemanticKITTI Sequence 25.91 15.93
SemanticKITTI Monocular 24.03 13.03
Occ3D-NuScenes (Surround) 34.15 20.42
QuadOcc (Robot) 20.56 19.19
Human360Occ (Cross-City) 32.23 24.15

7. Design Considerations, Deployability, and Applicability

OccAny does not require camera intrinsics, extrinsics, or sensor rig priors. The architecture accommodates arbitrary camera configurations and grid sizes/resolutions, making it suitable for out-of-domain scenes, robotics, and urban environments without further calibration. Lightweight variants (101.8M parameters, ≈1.86GB at FP32) achieve 14–19 FPS (FP32/FP16) on consumer GPUs. Embedded (Jetson class) deployment yields >10 FPS with kernel sparsity, supporting onboard real-time occupancy estimation for autonomous navigation in legged robots or urban perception stacks (Shi et al., 5 Nov 2025).

OccAny’s design advances 3D scene completion by leveraging a unified transformer backbone, segmentation-augmented geometry completion, and test-time novel view synthesis, yielding robust performance, generalization, and versatility across standard urban and robotics 3D occupancy tasks (Cao et al., 24 Mar 2026, Shi et al., 5 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to OccAny.