Papers
Topics
Authors
Recent
Search
2000 character limit reached

FB-OCC: Fast BEV Occupancy Model

Updated 16 January 2026
  • FB-OCC is a deep neural architecture that employs a dual-stream forward-backward view transformation to generate detailed 3D voxel and BEV features for occupancy prediction.
  • It integrates multi-scale voxel feature fusion with a Partial FPN to achieve efficient spatial aggregation while reducing computational cost.
  • Joint depth–semantic pre-training and auxiliary losses enhance geometric and semantic alignment, leading to state-of-the-art mIoU performance on challenging benchmarks.

The FB-OCC model (Forward-Backward Occupancy or Fast BEV-Occupancy Model, depending on context) refers to a family of deep neural architectures for 3D occupancy prediction from multi-view images, targeting the semantic scene completion and occupancy segmentation tasks crucial for autonomous driving. FB-OCC methods build upon the principle of mapping multi-view camera input into fine-grained 3D voxel representations, leveraging view transformation strategies, efficient feature fusion, and tailored training procedures to maximize accuracy and computational viability.

1. Architectural Foundations and Forward–Backward View Transformation

FB-OCC architectures are grounded in a dual-stream view transformation paradigm, motivated by the need for rich 3D context and efficient utilization of multi-view imagery. The forward-backward framework first applies a forward-projection module (F-VTM), wherein per-pixel depth distributions Pi(u,v,d)P_i(u, v, d) are predicted and used to unproject each pixel (u,v)(u, v) of every camera ii into a discretized 3D voxel tensor Ffwd(x,y,z)F^{\mathrm{fwd}}(x, y, z) via

Ffwd(x,y,z)=i=1Ncamu,vPi(u,v,z)Ii(u,v)δ(X(u,v,z)/Δx=x)δ(Y(u,v,z)/Δy=y).F^{\mathrm{fwd}}(x, y, z) = \sum_{i=1}^{N_{\text{cam}}} \sum_{u, v} P_i(u, v, z) \cdot I_i(u, v) \cdot \delta(\lfloor X(u, v, z)/\Delta x \rfloor = x)\delta(\lfloor Y(u, v, z)/\Delta y \rfloor = y).

Depth uncertainty is preserved through the learned PiP_i distributions, yielding smooth feature accumulation in 3D space. The backward-projection module (B-VTM) aggregates FfwdF^{\mathrm{fwd}} along the height (Z) axis to produce BEV features, then refines these via deformable attention that leverages depth-aware sampling—a design shared with models such as BEVFormer. Fused volumetric features are obtained by concatenating the voxel and BEV representations along the appropriate axes and further processed with 3D convolutions prior to the occupancy prediction head (Li et al., 2023).

In a related variant focused on computational efficiency (Lu et al., 2024), explicit 3D query operations are replaced by a "lifting" process: 2D BEV feature maps FRC×H×WF \in \mathbb{R}^{C \times H \times W} are lifted into 3D tensors VRC×D×H×WV \in \mathbb{R}^{C' \times D \times H \times W} using a deformable Conv2D operation. Here, the deformable kernel offsets Δpk\Delta p_k adaptively sample the 2D BEV in a way that increases expressivity with negligible added cost compared to dense 3D convolutions.

2. Multi-Scale Voxel Feature Fusion and Partial FPN

FB-OCC variants employ a "Partial Voxel Feature Pyramid Network" (Partial FPN) that fuses voxel features at multiple spatial scales, but crucially restricts multi-scale processing to the ground-plane (xy) axes. For each height slice hh, half the layers proceed unmodified while the other half undergo progressive downsampling via strided 2D convolutions, recursively aggregating coarse-to-fine spatial features. The coarse 3D context is integrated at the lowest-resolution level using a lightweight 3D convolution, then upsampled and fused with the original independent per-slice features via 1×11 \times 1 convolutions and concatenation. Formally, at each level \ell:

Vdown+1=Conv2Dstride2(Veven h),Vagg=Conv2D1×1(Vodd h)+Upsample(Vdown+1),V_{\text{down}}^{\ell+1} = \operatorname{Conv2D}_{\text{stride}2}(V^{\ell}_{\text{even}~h}),\quad V_{\text{agg}}^{\ell} = \operatorname{Conv2D}_{1 \times 1}(V^{\ell}_{\text{odd}~h}) + \operatorname{Upsample}(V_{\text{down}}^{\ell+1}),

Vout=Concath(Vagg,Vdown,).V^{\text{out}} = \operatorname{Concat}_h(V_{\text{agg}}^{\ell}, V_{\text{down}}^{\ell}, \ldots).

This design reduces FPN inference cost by more than 4×4 \times compared to full 3D-FPN while incurring negligible accuracy loss (Lu et al., 2024).

3. Joint Depth–Semantic Pre-Training and Auxiliary Losses

Leading FB-OCC instantiations leverage multi-task pre-training to enhance geometric and semantic representation. A typical strategy involves cross-entropy depth supervision (over discretized bins), semantic segmentation supervision, or both, using LiDAR-projected depth as ground truth and vision foundation model-derived masks (e.g., from SAM) for semantic targets. The joint loss formulation is

Lpre=Ld+λsLs,\mathcal{L}_{\text{pre}} = \mathcal{L}_d + \lambda_{s} \mathcal{L}_s,

where Ld\mathcal{L}_d and Ls\mathcal{L}_s are supervised losses for depth and both pixel-wise semantics, and λs\lambda_s is a weighting factor. This pre-training produces better initialization for subsequent voxel-wise occupancy prediction (Li et al., 2023).

Additional auxiliary branches, e.g., cost-free 2D segmentation in perspective view, target visible pixels post-FPN and provide focal loss supervision only during training, yielding over 1%1\% mIoU gain without impacting inference efficiency (Lu et al., 2024). Training losses are also masked to exclude voxels or pixels never visible to sensors, further improving convergence and robustness.

4. Model Scaling and Computational Efficiency

FB-OCC demonstrates favorable scaling properties as model and resolution are increased. Systematic scaling along three axes—backbone size (from ResNet50 to InternImage-H), input image resolution, and voxel grid granularity—consistently yields $2$–$3$ point increases in mIoU per doubling, ultimately reaching up to 54.19%54.19\% mIoU on the nuScenes test split with ensembled 1.2B-parameter models (Li et al., 2023).

To address the inference bottleneck, the fast FB-OCC variant (Lu et al., 2024) replaces 3D deformable attention (incurring 48×48 \times baseline latency) with 2D deformable convolution-based lifting (5.4×5.4 \times latency), and full 3D-FPN (26\% total cost) with the Partial FPN (9.6\% cost). This results in $2$–3×3 \times lower inference latency compared to baseline OccNet and superior throughput on hardware such as MI100/MI250.

Method mIoU (Res50) Latency (MI100) Speedup vs. OccNet
OccNet 19.48 3.30× 1.0×
FB-OCC 21.12 1.22× 2.7×
BEVNet base 17.37 1.00× 3.3×

5. Post-Processing and Model Ensembling Strategies

FB-OCC deployments employ advanced post-processing for maximum accuracy. Test-time augmentation (TTA) includes horizontal flips in image and world space, as well as temporal aggregation across TT frames, leading to final predictions computed as means across 8×T8 \times T variants. Temporal TTA for static voxels further stabilizes predictions for classes such as road and vegetation by substituting noisy frames with the nearest aligned previous prediction.

Ensemble strategies assign per-model, per-category weights Oens(c,x,y,z)=i=1Mαi(c)Oi(c,x,y,z)O_{\mathrm{ens}}(c, x, y, z) = \sum_{i=1}^M \alpha_{i}(c) O_i(c, x, y, z), with weights jointly optimized using model-wide and category IoU statistics on a validation split. These post-processing procedures collectively yield robust occupancy maps with minimal far-range noise and strong generalization (Li et al., 2023).

6. Empirical Performance and Benchmarks

FB-OCC has established state-of-the-art results across several benchmarks. Notably, it set the highest reported mIoU (54.19%54.19\%) on the Occ3D-nuScenes challenge, surpassing earlier monocular and BEV-centric approaches by a wide margin:

Method mIoU (nuScenes val)
MonoScene 6.06
BEVFormer 26.88
CTF-Occ 28.53
FB-OCC 54.19 (test)

Ablation experiments demonstrated individual module contributions: depth supervision (+4 pp), ignoring invisible voxels (+8 pp), joint pre-training (+0.8 pp), advanced losses (+0.8 pp), spatial TTA (+1.4 pp). On OpenOcc and SemanticKITTI, FB-OCC showed consistent or improved classwise accuracy and inference speed relative to OccNet and TPVFormer (Lu et al., 2024). A plausible implication is that joint view transformation and partial 3D processing offer a favorable trade-off for real-time large-scale 3D scene understanding.

7. Design Principles and Application Scope

The core principles established by FB-OCC are: (1) dual-stream 3D/BEV feature construction via forward and backward view transformation; (2) cost-efficient multi-scale 3D fusion restricted to spatial axes; (3) multi-task pre-training for geometric/semantic alignment; (4) elimination of computational bottlenecks using deformable convolutions and minimal 3D computation; and (5) robust ensemble-based post-processing. These methods have direct application in end-to-end vision-centric autonomous driving stacks, high-precision HD mapping, and robotics environments requiring fine-grained 3D occupancy segmentation from images alone (Li et al., 2023, Lu et al., 2024).

Future work likely includes extension to denser voxel grids, broader sensor fusion (radar/LiDAR), and further reductions in runtime for embedded deployment.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to FB-OCC Model.