FB-OCC: Fast BEV Occupancy Model
- FB-OCC is a deep neural architecture that employs a dual-stream forward-backward view transformation to generate detailed 3D voxel and BEV features for occupancy prediction.
- It integrates multi-scale voxel feature fusion with a Partial FPN to achieve efficient spatial aggregation while reducing computational cost.
- Joint depth–semantic pre-training and auxiliary losses enhance geometric and semantic alignment, leading to state-of-the-art mIoU performance on challenging benchmarks.
The FB-OCC model (Forward-Backward Occupancy or Fast BEV-Occupancy Model, depending on context) refers to a family of deep neural architectures for 3D occupancy prediction from multi-view images, targeting the semantic scene completion and occupancy segmentation tasks crucial for autonomous driving. FB-OCC methods build upon the principle of mapping multi-view camera input into fine-grained 3D voxel representations, leveraging view transformation strategies, efficient feature fusion, and tailored training procedures to maximize accuracy and computational viability.
1. Architectural Foundations and Forward–Backward View Transformation
FB-OCC architectures are grounded in a dual-stream view transformation paradigm, motivated by the need for rich 3D context and efficient utilization of multi-view imagery. The forward-backward framework first applies a forward-projection module (F-VTM), wherein per-pixel depth distributions are predicted and used to unproject each pixel of every camera into a discretized 3D voxel tensor via
Depth uncertainty is preserved through the learned distributions, yielding smooth feature accumulation in 3D space. The backward-projection module (B-VTM) aggregates along the height (Z) axis to produce BEV features, then refines these via deformable attention that leverages depth-aware sampling—a design shared with models such as BEVFormer. Fused volumetric features are obtained by concatenating the voxel and BEV representations along the appropriate axes and further processed with 3D convolutions prior to the occupancy prediction head (Li et al., 2023).
In a related variant focused on computational efficiency (Lu et al., 2024), explicit 3D query operations are replaced by a "lifting" process: 2D BEV feature maps are lifted into 3D tensors using a deformable Conv2D operation. Here, the deformable kernel offsets adaptively sample the 2D BEV in a way that increases expressivity with negligible added cost compared to dense 3D convolutions.
2. Multi-Scale Voxel Feature Fusion and Partial FPN
FB-OCC variants employ a "Partial Voxel Feature Pyramid Network" (Partial FPN) that fuses voxel features at multiple spatial scales, but crucially restricts multi-scale processing to the ground-plane (xy) axes. For each height slice , half the layers proceed unmodified while the other half undergo progressive downsampling via strided 2D convolutions, recursively aggregating coarse-to-fine spatial features. The coarse 3D context is integrated at the lowest-resolution level using a lightweight 3D convolution, then upsampled and fused with the original independent per-slice features via convolutions and concatenation. Formally, at each level :
This design reduces FPN inference cost by more than compared to full 3D-FPN while incurring negligible accuracy loss (Lu et al., 2024).
3. Joint Depth–Semantic Pre-Training and Auxiliary Losses
Leading FB-OCC instantiations leverage multi-task pre-training to enhance geometric and semantic representation. A typical strategy involves cross-entropy depth supervision (over discretized bins), semantic segmentation supervision, or both, using LiDAR-projected depth as ground truth and vision foundation model-derived masks (e.g., from SAM) for semantic targets. The joint loss formulation is
where and are supervised losses for depth and both pixel-wise semantics, and is a weighting factor. This pre-training produces better initialization for subsequent voxel-wise occupancy prediction (Li et al., 2023).
Additional auxiliary branches, e.g., cost-free 2D segmentation in perspective view, target visible pixels post-FPN and provide focal loss supervision only during training, yielding over mIoU gain without impacting inference efficiency (Lu et al., 2024). Training losses are also masked to exclude voxels or pixels never visible to sensors, further improving convergence and robustness.
4. Model Scaling and Computational Efficiency
FB-OCC demonstrates favorable scaling properties as model and resolution are increased. Systematic scaling along three axes—backbone size (from ResNet50 to InternImage-H), input image resolution, and voxel grid granularity—consistently yields $2$–$3$ point increases in mIoU per doubling, ultimately reaching up to mIoU on the nuScenes test split with ensembled 1.2B-parameter models (Li et al., 2023).
To address the inference bottleneck, the fast FB-OCC variant (Lu et al., 2024) replaces 3D deformable attention (incurring baseline latency) with 2D deformable convolution-based lifting ( latency), and full 3D-FPN (26\% total cost) with the Partial FPN (9.6\% cost). This results in $2$– lower inference latency compared to baseline OccNet and superior throughput on hardware such as MI100/MI250.
| Method | mIoU (Res50) | Latency (MI100) | Speedup vs. OccNet |
|---|---|---|---|
| OccNet | 19.48 | 3.30× | 1.0× |
| FB-OCC | 21.12 | 1.22× | 2.7× |
| BEVNet base | 17.37 | 1.00× | 3.3× |
5. Post-Processing and Model Ensembling Strategies
FB-OCC deployments employ advanced post-processing for maximum accuracy. Test-time augmentation (TTA) includes horizontal flips in image and world space, as well as temporal aggregation across frames, leading to final predictions computed as means across variants. Temporal TTA for static voxels further stabilizes predictions for classes such as road and vegetation by substituting noisy frames with the nearest aligned previous prediction.
Ensemble strategies assign per-model, per-category weights , with weights jointly optimized using model-wide and category IoU statistics on a validation split. These post-processing procedures collectively yield robust occupancy maps with minimal far-range noise and strong generalization (Li et al., 2023).
6. Empirical Performance and Benchmarks
FB-OCC has established state-of-the-art results across several benchmarks. Notably, it set the highest reported mIoU () on the Occ3D-nuScenes challenge, surpassing earlier monocular and BEV-centric approaches by a wide margin:
| Method | mIoU (nuScenes val) |
|---|---|
| MonoScene | 6.06 |
| BEVFormer | 26.88 |
| CTF-Occ | 28.53 |
| FB-OCC | 54.19 (test) |
Ablation experiments demonstrated individual module contributions: depth supervision (+4 pp), ignoring invisible voxels (+8 pp), joint pre-training (+0.8 pp), advanced losses (+0.8 pp), spatial TTA (+1.4 pp). On OpenOcc and SemanticKITTI, FB-OCC showed consistent or improved classwise accuracy and inference speed relative to OccNet and TPVFormer (Lu et al., 2024). A plausible implication is that joint view transformation and partial 3D processing offer a favorable trade-off for real-time large-scale 3D scene understanding.
7. Design Principles and Application Scope
The core principles established by FB-OCC are: (1) dual-stream 3D/BEV feature construction via forward and backward view transformation; (2) cost-efficient multi-scale 3D fusion restricted to spatial axes; (3) multi-task pre-training for geometric/semantic alignment; (4) elimination of computational bottlenecks using deformable convolutions and minimal 3D computation; and (5) robust ensemble-based post-processing. These methods have direct application in end-to-end vision-centric autonomous driving stacks, high-precision HD mapping, and robotics environments requiring fine-grained 3D occupancy segmentation from images alone (Li et al., 2023, Lu et al., 2024).
Future work likely includes extension to denser voxel grids, broader sensor fusion (radar/LiDAR), and further reductions in runtime for embedded deployment.