ShelfOcc: Occupancy Modeling in Shelved Environments

Updated 26 November 2025

ShelfOcc is a computational paradigm for modeling object occupancy in shelf-like domains using probabilistic voxel grids and belief maps.
It integrates multi-view and temporal fusion, employing CNN segmentation and Bayesian filters to robustly address occlusion and enhance scene understanding.
The approach supports advanced robotic planning and manipulation by leveraging semantic priors and hybrid policies for efficient search, retrieval, and tidying.

ShelfOcc encompasses a family of algorithms, representations, and robotic perception-action pipelines focused on modeling and reasoning about object occupancy within shelf-like environments under occlusion, uncertainty, and constrained viewpoint conditions. The core concept involves constructing spatial occupancy distributions—typically in the form of probabilistic voxel grids, 2D spatial densities, or belief maps—that enable robust segmentation, search, and manipulation in complex cluttered domains, particularly for autonomous robots in warehouses, retail, or autonomous driving scenes. ShelfOcc architectures fuse visual or depth sensor data across multiple views or timesteps, applying learning-based or knowledge-enabled priors to propagate belief across occluded regions and guide downstream action sequences for retrieval, tidying, or 3D scene understanding.

1. Formal Representations of Shelf Occupancy

ShelfOcc methodologies instantiate occupancy as a discretized representation over the workspace of a shelf or shelf bin. Commonly this takes the form of a 3D voxel grid $V = \{v_1, ..., v_N\}$ , where each voxel is associated with a probability $P_i\in[0,1]$ that it is occupied by an object of interest. In "3D Object Segmentation for Shelf Bin Picking by Humanoid with Deep Learning and Occupancy Voxel Grid Map" (Wada et al., 2020), the workspace is bounded and discretized with a uniform grid (e.g., $X \in [-0.25,0.25]$ m, $Y,Z \in [0,0.50]$ m, grid size $\Delta=0.005$ m), yielding $N_x\times N_y\times N_z$ voxels. Each occupancy probability is updated from per-view CNN predictions using noisy-OR fusion:

$P_i = 1 - \prod_{k=1}^K (1 - p_i^k)$

where $p_i^k$ is the prediction from view $k$ .

Recent extensions in autonomous driving ("ShelfOcc: Native 3D Supervision beyond LiDAR for Vision-Based Occupancy Estimation" (Boeder et al., 19 Nov 2025)) generalize such occupancy maps to semantic voxel grids $V_t(v)\in\{0,...,K\}$ , encoding multi-class semantic labels with visibility masks in BEV or 3D grids (e.g., $X,Y\in[-40,40]$ m, $Z\in[-1,5.4]$ m, $\Delta=0.4$ m). Each cell is assigned free, occupied, or unobserved status for robust supervision and downstream loss function weighting.

2. Multi-View and Temporal Fusion Mechanisms

The challenge of occlusion is tackled via multi-view data accumulation and probabilistic fusion:

In shelf bin picking, multi-view RGB-D images are segmented by a fully convolutional CNN (VGG-style encoder, upconv decoder producing per-pixel softmax for object/background) (Wada et al., 2020). Segmentation probabilities are back-projected into the 3D voxel grid using camera intrinsics and extrinsics, aggregating probability estimates per voxel across views with the noisy-OR mechanism. This creates a dense 3D occupancy map robust to occlusion from the shelf geometry.
For large-scale scene understanding, ShelfOcc generates metrically consistent 3D pseudo-labels by aggregating static geometry across temporally aligned multi-camera video, filtering out spurious or noisy depths, and reintegrating dynamic objects framewise (Boeder et al., 19 Nov 2025). Semantic masks are obtained via open-vocabulary 2D FMs (GroundedSAM), lifted to 3D via depth unprojection, then accumulated and thresholded in voxel space for direct supervision of occupancy models.
Temporal fusion is crucial for robust static-dynamic disambiguation. Confidence filtering and majority voting are leveraged to prune ambiguous voxels, while camera visibility masks distinguish genuine free-space, occupied, and unobserved regions to prevent depth-bleeding and hallucination artifacts common to 2D rendering-based supervision.

3. Perception, Segmentation, and Knowledge Integration

ShelfOcc approaches incorporate deep learning, geometric, and symbolic reasoning to build occupancy belief. In knowledge-enabled agents (Winkler et al., 2016), perception pipelines utilize RGB-D segmentation, object hypothesizing via texture (SIFT/SURF + RANSAC) and shape (hierarchical part graphs), with detection confidence used to update cell-wise occupancy through a Bayesian filter:

$b_t(o_i) = \eta \cdot P(Z_t|o_i) \cdot \sum_{o_i'} P(o_i|o_i') b_{t-1}(o_i')$

with persistence modeling and knowledge-driven semantic priors governing occlusion inference (e.g., stack-height priors, typical shelf arrangements). Ontological relations are exploited to suggest likely missing or occluded items, prioritizing manipulation actions that maximize expected information gain regarding shelf cell beliefs.

4. Planning, Search, and Manipulation Under Occlusion

Robotic search and manipulation tasks on shelves require ShelfOcc-style occupancy belief propagation to guide the selection of actions that efficiently reveal targets or restore shelf order:

In cluttered shelf search (Bejjani et al., 2020), the environment is modeled as a POMDP with the robot and placements encoded in the state, actions as planar displacements and gripper toggles, and observations as abstract masked images. Belief over target location is encoded in a neural generative head, trained via PPO and supervised cross-entropy. A hybrid planner interleaves receding-horizon physics simulation and closed-loop policy execution, weighting candidate search actions by expected return and belief heatmap peaks.
Mechanical search via lateral access (LAX-RAY, (Huang et al., 2020)) employs a distributional occupancy model over the projected shelf plane. Actions consist of lateral pushes planned to optimally reduce occupancy support area (DAR), entropy (DER-1), or expected entropy after $n$ sequential pushes (DER-MT). Policies are quantitatively benchmarked in simulation (FOSS) and real robot environments, with entropy reduction yielding superior performance in heavily occluded scenarios.
Knowledge-enabled shelf tidying agents (Winkler et al., 2016) plan object retrieval by rank-ordering occluder removal actions according to belief, semantic priors, and occlusion scores, then executing pick-place sequences maximizing layout quality and minimizing manipulation cost.

5. Algorithms, Loss Functions, and Training Protocols

ShelfOcc methods are tailored to their target architecture and environment:

Method	Representation	Training Target	Loss Function(s)
Bin Picking	3D probabilistic voxel grid	Fused multi-view object occupancy	Per-pixel weighted cross-entropy, Adam optimizer (Wada et al., 2020)
Vision Supervision (ShelfOcc, Driving)	BEV semantic voxel grid	3D pseudo-label occupancy & semantics	BCE for occupancy, CE for semantics, visibility mask weighting (Boeder et al., 19 Nov 2025)
Shelf Tidy/Search	Cell-wise Bayes filter	Object detections, ontological priors	Entropy gain, cost heuristics, symbolic A* (Winkler et al., 2016)
Mechanical Search	2D support distribution	2D density from depth and segmentation	Area reduction, entropy minimization, simulation-based rollouts (Huang et al., 2020)
Cluttered Retrieval	Belief map (heatmap)	Reinforcement and supervised signals	PPO surrogate, supervised pixelwise cross-entropy (Bejjani et al., 2020)

Details such as AdamW, linear warmup, backbone selection (ResNet-50 to VoVNet-99), and distributed training (8 GPUs) are employed for deep occupancy models (Boeder et al., 19 Nov 2025). Weighting strategies, confidence filtering, and data augmentations are established through ablation.

6. Quantitative Evaluation and Comparative Benchmarks

ShelfOcc and its variants have been extensively validated in both simulation and real environments:

Bin Picking: Dense 3D segmentation enables successful grasp planning and target retrieval from narrow shelf bins, with real robot evaluations showing successful extraction even under occlusion (Wada et al., 2020).
Semantic Occupancy (Driving): ShelfOcc delivers superior geometric and semantic performance on Occ3D-nuScenes, outperforming prior shelf-supervised methods by up to 34% relative mIoU and 20% IoU, and approaching ~50% of fully LiDAR-supervised performance using only camera supervision (Boeder et al., 19 Nov 2025).
Mechanical Search: DER-2 achieves highest overall simulation success (87.3%), real-world trials with Fetch robot exceed 80% success for all policies, and DER variants outperform uniform baselines especially in multi-layer occlusion (Huang et al., 2020).
Cluttered Shelf Retrieval: The hybrid occlusion-aware planner reaches 90% success in real trials and near-real-time operation, surpassing model-free RL and hierarchical search (Bejjani et al., 2020).
Knowledge-Enabled Tidying: Manipulation sequences informed by occupancy belief and semantic priors yield occupancy accuracy boosts from ~68% to ~96% in simulation and robust correction of occlusion in real shelf arrangements (Winkler et al., 2016).

7. Limitations, Open Issues, and Future Directions

ShelfOcc procedure limitations include noisy pseudo-labels for rare/finely grained classes, incomplete reconstruction of dynamic objects, and reliance on high-quality vision foundation models, which may degrade in clutter or adverse conditions (Boeder et al., 19 Nov 2025). For mechanical and search-focused methods, simulation gaps (e.g., ignored friction, simplified contacts) can result in divergence from real performance; compounding prediction errors in multi-step lookahead limit deeper planning (Huang et al., 2020). Knowledge-driven reasoning depends heavily on the accuracy of semantic priors and detection models.

Current research directions target temporal fusion for dynamic objects (scene flow, tracking), end-to-end co-training or distillation to improve semantic/geometric fidelity, adaptive discretization for finer object resolution, and expansion to 4D occupancy forecasting (Boeder et al., 19 Nov 2025). Mechanical search approaches are expanding to incorporate richer uncertainty modeling, and integration of depth along the shelf axis for non-lateral retrieval scenarios (Huang et al., 2020).

ShelfOcc collectively defines a technical paradigm wherein occupancy belief—integrating multi-modal data, probabilistic fusion, semantic priors, and optimized planning—enables robust robotic perception and action amid occlusion, clutter, and limited viewpoint, and undergirds scalable progress in autonomous retail, logistics, and scene understanding.