Humanoid Occupancy in 3D Robotics

Updated 4 July 2026

Humanoid occupancy is a framework that discretizes 3D space into voxels carrying occupancy and semantic labels for humanoid robot perception.
It integrates multi-modal sensor data—such as cameras and LiDAR—with optimized sensor configurations to enhance scene understanding and navigation.
Recent systems leverage advanced annotation pipelines and fusion architectures, yielding improved accuracy in dynamic, human-rich environments.

Searching arXiv for papers on humanoid occupancy and closely related occupancy perception benchmarks. Humanoid occupancy denotes a family of occupancy-based representations, datasets, and perception systems tailored to humanoid robots, in which three-dimensional space is discretized or implicitly parameterized so that each spatial element encodes occupancy and, in many formulations, semantic class information. In recent work, the term specifically refers to a generalized multimodal occupancy perception system that integrates hardware and software components, data acquisition devices, sensor layout, and a dedicated annotation pipeline for humanoid robots (Cui et al., 27 Jul 2025). Adjacent literature broadens the concept in two directions: panoramic and stereo surround occupancy for embodied platforms (Shi et al., 5 Nov 2025, Guo et al., 22 Jun 2026), and predictive self-modeling in which a humanoid maps joint configurations to three-dimensional body occupancy (Chen et al., 11 Jun 2026). The resulting field spans scene understanding for navigation and manipulation, human-aware occupancy in pedestrian-rich environments (Kim et al., 21 Nov 2025), and bodily self-representation, all under the shared premise that occupancy is a unifying intermediate for downstream decision-making.

1. Formalization and conceptual scope

In the explicit grid-based formulation used by Humanoid Occupancy, $\mathbb{R}^3$ is discretized into a voxel grid of size $X\times Y\times Z$ , with occupancy probability

$o_{ijk}^{t}=P\bigl(\mathrm{O}(x_i,y_j,z_k,t)=1\bigr)\in[0,1]$

and semantic-label distribution

$s_{ijk,c}^{t}=P\bigl(Y(x_i,y_j,z_k,t)=c\bigr),\quad c\in\{1,\dots,C\}.$

A discrete occupancy-semantic tensor is then written as

$\mathbf{G}^{t}=\bigl\{\bigl(o_{ijk}^{t},\arg\max_c s_{ijk,c}^{t}\bigr)\bigr\}_{i,j,k},$

so that occupancy status and semantic identity are represented jointly rather than as separate maps (Cui et al., 27 Jul 2025).

This representation has close analogues in related datasets. OneOcc predicts a voxelized semantic occupancy grid $\mathbf{S}\in\{0,\ldots,C\}^{X\times Y\times Z}$ with per-voxel logits $\mathbf{Z}\in\mathbb{R}^{X\times Y\times Z\times C}$ and supports both Cartesian and cylindrical-polar discretizations to match panoramic sensing geometry (Shi et al., 5 Nov 2025). MobileOcc models occupancy as a function $f:\mathbb{R}^3\to[0,1]^C$ with $C=10$ for nine semantic classes plus free space, while also storing pedestrian instance identities within the grid (Kim et al., 21 Nov 2025). Humanoid-OmniOcc uses a robot-centric grid of $[44\times384\times384]$ voxels at $X\times Y\times Z$ 0 and predicts binary occupancy $X\times Y\times Z$ 1, with an extended semantic version containing 15 channels (Guo et al., 22 Jun 2026).

A distinct but closely related line replaces discrete semantic scene voxels with an implicit body-occupancy field. In the self-other distinction framework of Chen et al., the target is a kinematics-free predictive self-model $X\times Y\times Z$ 2, implemented through an implicit network that maps a query point $X\times Y\times Z$ 3, ray direction $X\times Y\times Z$ 4, and full robot state $X\times Y\times Z$ 5 to density $X\times Y\times Z$ 6 and visibility $X\times Y\times Z$ 7 via $X\times Y\times Z$ 8 (Chen et al., 11 Jun 2026). This suggests that humanoid occupancy is not restricted to exteroceptive scene completion; it also encompasses endogenous body-space modeling when occupancy is defined as “where the robot body is” under configuration $X\times Y\times Z$ 9.

Historically, humanoid use of occupancy voxel grids predates these large-scale benchmarks. Wada et al. discretized workspace around shelf bins at resolution $o_{ijk}^{t}=P\bigl(\mathrm{O}(x_i,y_j,z_k,t)=1\bigr)\in[0,1]$ 0, projected per-pixel object probabilities into a 3D voxel grid, fused them with log-odds updates, and extracted connected components for bin-picking; after 5 views, their occupancy-fusion method reported voxel precision $o_{ijk}^{t}=P\bigl(\mathrm{O}(x_i,y_j,z_k,t)=1\bigr)\in[0,1]$ 1, recall $o_{ijk}^{t}=P\bigl(\mathrm{O}(x_i,y_j,z_k,t)=1\bigr)\in[0,1]$ 2, and an 82% picking success rate (Wada et al., 2020). That earlier formulation was manipulation-centric and object-specific, whereas later humanoid occupancy systems generalize to full-scene semantic perception.

2. Sensor configurations and dataset regimes

A central theme in humanoid occupancy research is that sensor layout cannot be inherited directly from autonomous driving. Humanoid Occupancy formulates sensor placement as an optimization

$o_{ijk}^{t}=P\bigl(\mathrm{O}(x_i,y_j,z_k,t)=1\bigr)\in[0,1]$ 3

where $o_{ijk}^{t}=P\bigl(\mathrm{O}(x_i,y_j,z_k,t)=1\bigr)\in[0,1]$ 4 measures coverage and $o_{ijk}^{t}=P\bigl(\mathrm{O}(x_i,y_j,z_k,t)=1\bigr)\in[0,1]$ 5 penalizes overlap and self-occlusion. The reported solution uses six cameras with $o_{ijk}^{t}=P\bigl(\mathrm{O}(x_i,y_j,z_k,t)=1\bigr)\in[0,1]$ 6 plus one $o_{ijk}^{t}=P\bigl(\mathrm{O}(x_i,y_j,z_k,t)=1\bigr)\in[0,1]$ 7 LiDAR on a stabilizing neck mount, yielding $o_{ijk}^{t}=P\bigl(\mathrm{O}(x_i,y_j,z_k,t)=1\bigr)\in[0,1]$ 8 horizontal coverage with minimal self-occlusion (Cui et al., 27 Jul 2025). The associated data collection platform is a wearable head-rig with identical 6 RGB cameras and a 40-line $o_{ijk}^{t}=P\bigl(\mathrm{O}(x_i,y_j,z_k,t)=1\bigr)\in[0,1]$ 9 LiDAR, a collector height of $s_{ijk,c}^{t}=P\bigl(Y(x_i,y_j,z_k,t)=c\bigr),\quad c\in\{1,\dots,C\}.$ 0 cm, and a neck stabilizer to suppress shake.

Other systems adopt different sensing trade-offs while targeting humanoid or humanoid-adjacent embodiments.

Resource	Sensor configuration	Data regime
Humanoid Occupancy (Cui et al., 27 Jul 2025)	6 RGB cameras + 40-line $s_{ijk,c}^{t}=P\bigl(Y(x_i,y_j,z_k,t)=c\bigr),\quad c\in\{1,\dots,C\}.$ 1 LiDAR	Home / Industrial / Outdoor clips
OneOcc (Shi et al., 5 Nov 2025)	Single panoramic camera	QuadOcc and Human360Occ benchmarks
Humanoid-OmniOcc (Guo et al., 22 Jun 2026)	Four stereo rigs at $s_{ijk,c}^{t}=P\bigl(Y(x_i,y_j,z_k,t)=c\bigr),\quad c\in\{1,\dots,C\}.$ 2, $s_{ijk,c}^{t}=P\bigl(Y(x_i,y_j,z_k,t)=c\bigr),\quad c\in\{1,\dots,C\}.$ 3, $s_{ijk,c}^{t}=P\bigl(Y(x_i,y_j,z_k,t)=c\bigr),\quad c\in\{1,\dots,C\}.$ 4, $s_{ijk,c}^{t}=P\bigl(Y(x_i,y_j,z_k,t)=c\bigr),\quad c\in\{1,\dots,C\}.$ 5 yaw	15 simulated indoor scenes + 5 real environments
MobileOcc (Kim et al., 21 Nov 2025)	Front stereo camera + dense Ouster-style LiDAR	Outdoor pedestrian-rich campus trajectories

OneOcc is explicitly designed for legged and humanoid robots with a single panoramic camera, emphasizing gait-introduced body jitter and $s_{ijk,c}^{t}=P\bigl(Y(x_i,y_j,z_k,t)=c\bigr),\quad c\in\{1,\dots,C\}.$ 6 continuity (Shi et al., 5 Nov 2025). Humanoid-OmniOcc instead follows a surround-stereo design derived from the exact sensor specifications of a Unitree G1 humanoid head: four stereo rigs with baseline $s_{ijk,c}^{t}=P\bigl(Y(x_i,y_j,z_k,t)=c\bigr),\quad c\in\{1,\dots,C\}.$ 7 cm, focal length $s_{ijk,c}^{t}=P\bigl(Y(x_i,y_j,z_k,t)=c\bigr),\quad c\in\{1,\dots,C\}.$ 8 px, horizontal FoV $s_{ijk,c}^{t}=P\bigl(Y(x_i,y_j,z_k,t)=c\bigr),\quad c\in\{1,\dots,C\}.$ 9, vertical FoV $\mathbf{G}^{t}=\bigl\{\bigl(o_{ijk}^{t},\arg\max_c s_{ijk,c}^{t}\bigr)\bigr\}_{i,j,k},$ 0, and rectified image size $\mathbf{G}^{t}=\bigl\{\bigl(o_{ijk}^{t},\arg\max_c s_{ijk,c}^{t}\bigr)\bigr\}_{i,j,k},$ 1 with FoV $\mathbf{G}^{t}=\bigl\{\bigl(o_{ijk}^{t},\arg\max_c s_{ijk,c}^{t}\bigr)\bigr\}_{i,j,k},$ 2 after rectification (Guo et al., 22 Jun 2026). Its Real2Sim2Real paradigm is defined by real sensor specifications driving physically accurate simulation, simulation generating annotated training data, and models trained in simulation being directly evaluated on real-world captures.

MobileOcc addresses a different but related deployment regime: mobile robots navigating densely pedestrian-populated, near-field outdoor scenes. It is built on the UT Campus Object Dataset, which provides time-synchronized front stereo RGB and dense LiDAR at 10 Hz; for occupancy annotation the streams are downsampled to 5 Hz, producing 116,511 frames, of which 37,622 contain at least one pedestrian (Kim et al., 21 Nov 2025). Although not a humanoid dataset, its human-aware occupancy formulation is directly relevant to humanoid robots operating in shared pedestrian spaces.

3. Annotation pipelines and ground-truth generation

The defining difficulty in humanoid occupancy is not only voxel prediction but also voxel supervision. Humanoid Occupancy generates scene labels by combining dynamic-object handling with static-scene aggregation. Dynamic objects receive 3D bounding boxes for “ordinary” pedestrians, cyclists, and vehicles, alongside point-wise segmentation inside boxes for “special-pose” pedestrians. Static scenes are built by multi-frame LiDAR stitching,

$\mathbf{G}^{t}=\bigl\{\bigl(o_{ijk}^{t},\arg\max_c s_{ijk,c}^{t}\bigr)\bigr\}_{i,j,k},$ 3

followed by voxelization of the merged cloud and majority-vote semantic assignment per voxel (Cui et al., 27 Jul 2025). This pipeline yields occupancy-semantic supervision over three scene types—Home, Industrial, and Outdoor—with 200 frames $\mathbf{G}^{t}=\bigl\{\bigl(o_{ijk}^{t},\arg\max_c s_{ijk,c}^{t}\bigr)\bigr\}_{i,j,k},$ 4 (180 train + 20 val) clips and 8–13 semantic classes per scene.

Humanoid-OmniOcc uses a different annotation regime built around simulation-first labeling and LiDAR-based real-world verification. In simulation, each frame’s mesh is voxelized at $\mathbf{G}^{t}=\bigl\{\bigl(o_{ijk}^{t},\arg\max_c s_{ijk,c}^{t}\bigr)\bigr\}_{i,j,k},$ 5 cm and then back-projected into the cameras for depth consistency, assigning label $\mathbf{G}^{t}=\bigl\{\bigl(o_{ijk}^{t},\arg\max_c s_{ijk,c}^{t}\bigr)\bigr\}_{i,j,k},$ 6 for free, $\mathbf{G}^{t}=\bigl\{\bigl(o_{ijk}^{t},\arg\max_c s_{ijk,c}^{t}\bigr)\bigr\}_{i,j,k},$ 7 for occupied, and $\mathbf{G}^{t}=\bigl\{\bigl(o_{ijk}^{t},\arg\max_c s_{ijk,c}^{t}\bigr)\bigr\}_{i,j,k},$ 8 for unknown according to

$\mathbf{G}^{t}=\bigl\{\bigl(o_{ijk}^{t},\arg\max_c s_{ijk,c}^{t}\bigr)\bigr\}_{i,j,k},$ 9

In real environments, LiDAR point clouds are fused, denoised, voxelized into the same grid, and processed with a Bresenham ray-tracer to mark free, occupied, and unknown space within each camera’s FoV, after which semantic labels are manually assigned (Guo et al., 22 Jun 2026).

MobileOcc’s annotation pipeline is the most elaborate for human occupancy. Static occupancy is derived by removing dynamic pedestrian points from LiDAR returns inside 3D pedestrian detections, fusing the remaining points over time into a global OctoMap, and assigning final static semantic labels through per-class 2D semantic counts projected onto 3D points with a max-vote rule (Kim et al., 21 Nov 2025). Human occupancy is then superimposed through a dedicated mesh optimization framework: an initial CLIFF estimate of SMPL parameters $\mathbf{S}\in\{0,\ldots,C\}^{X\times Y\times Z}$ 0, $\mathbf{S}\in\{0,\ldots,C\}^{X\times Y\times Z}$ 1, and $\mathbf{S}\in\{0,\ldots,C\}^{X\times Y\times Z}$ 2 is refined by visibility filtering, rigid ICP alignment against per-instance LiDAR points, and joint non-rigid optimization of $\mathbf{S}\in\{0,\ldots,C\}^{X\times Y\times Z}$ 3 under a multi-term loss

$\mathbf{S}\in\{0,\ldots,C\}^{X\times Y\times Z}$ 4

The static map and refined SMPL meshes are finally voxelized at $\mathbf{S}\in\{0,\ldots,C\}^{X\times Y\times Z}$ 5 m resolution and transformed into a shared robot-local grid (Kim et al., 21 Nov 2025). This is a human-aware annotation pipeline rather than a purely geometric occupancy pipeline, because it explicitly models deformable pedestrian geometry.

An important methodological implication is that humanoid occupancy labels are increasingly produced by hybrid procedures: geometric carving for free space, semantic voting for static structure, and model-based fitting for humans or articulated bodies. The literature suggests that annotation fidelity is becoming as decisive as network design.

4. Model architectures and fusion strategies

Humanoid occupancy models differ primarily in how they align heterogeneous sensors and lift 2D or point-set evidence into 3D. In Humanoid Occupancy, the camera branch uses a shared ResNet50 + FPN backbone, producing $\mathbf{S}\in\{0,\ldots,C\}^{X\times Y\times Z}$ 6 at strides $\mathbf{S}\in\{0,\ldots,C\}^{X\times Y\times Z}$ 7, while the LiDAR branch uses PointPillars to generate a BEV feature map $\mathbf{S}\in\{0,\ldots,C\}^{X\times Y\times Z}$ 8 (Cui et al., 27 Jul 2025). Fusion is performed by deformable cross-attention, with LiDAR BEV features treated as queries and camera features as keys and values:

$\mathbf{S}\in\{0,\ldots,C\}^{X\times Y\times Z}$ 9

Temporal integration follows a BEVDet4D-style design in which past BEV features are warped into the current frame and concatenated before a BEV encoder refines them. The final head reshapes channel features into height bins and predicts occupancy with a sigmoid branch and semantics with a softmax branch.

OneOcc addresses a different failure mode: panoramic sensing on legged or humanoid platforms with severe geometric discontinuities and gait jitter. Its architecture combines four modules (Shi et al., 5 Nov 2025). Dual-Projection Equirectangular–Radial fusion maintains parallel encoders for the raw annular image and its equirectangular unfolding, preserving both native panoramic geometry and convolution-friendly continuity. Bi-Grid Voxelization constructs both Cartesian and cylindrical-polar 3D volumes and injects polar context into the Cartesian stream through precomputed index mappings. The Hierarchical AMoE-3D decoder is a depthwise-separable 3D U-Net that uses dual-path volumetric saliency and gradient-energy-driven mixture-of-experts routing. Gait Displacement Compensation predicts a small per-scale 2D warp $\mathbf{Z}\in\mathbb{R}^{X\times Y\times Z\times C}$ 0 from pooled image features and shifts sampling coordinates during voxel lifting, thereby correcting feature-level motion misalignment without extra sensors.

Humanoid-OmniOcc’s HS model relies on stereo-guided depth priors to improve 2D-to-3D lifting (Guo et al., 22 Jun 2026). A shared 2D backbone extracts left and right image features for each stereo rig, a disparity cost volume is built by feature correlation, and this cost volume is converted into a depth-aligned volume that yields a depth posterior

$\mathbf{Z}\in\mathbb{R}^{X\times Y\times Z\times C}$ 1

Only the left-view features are then lifted into the robot-centric voxel grid through trilinear splatting weighted by the depth posterior. A lightweight 3D decoder maps the lifted tensor to occupancy logits, supervised by binary cross-entropy together with focal, geometric, lovász, and depth bin-wise terms.

A separate architectural branch appears in self-body occupancy modeling. Chen et al. use a part-aware proprioceptive encoder that partitions the 29 joint angles into torso, left and right arms, and left and right legs, encodes each group by small MLPs, and concatenates the results into a 256-D posture code (Chen et al., 11 Jun 2026). Query points and ray directions receive sinusoidal positional encoding, and an implicit MLP with 6–8 hidden layers and two output heads predicts density and visibility. Training is driven first by proprioceptive–visual correspondence through an InfoNCE objective and then by silhouette reconstruction using volumetric rendering. This is not a scene-semantic architecture, but it is an occupancy architecture in the strict sense.

5. Benchmarks, metrics, and quantitative findings

Evaluation protocols in humanoid occupancy are centered on per-voxel IoU and mIoU, but individual systems introduce additional metrics that reflect specific deployment goals. Humanoid Occupancy reports mIoU over all voxels and classes, together with rayIoU computed along sampled LiDAR beams (Cui et al., 27 Jul 2025). On its benchmark range $\mathbf{Z}\in\mathbb{R}^{X\times Y\times Z\times C}$ 2 m and $\mathbf{Z}\in\mathbb{R}^{X\times Y\times Z\times C}$ 3 m with voxel size $\mathbf{Z}\in\mathbb{R}^{X\times Y\times Z\times C}$ 4 m, single-frame camera-plus-LiDAR prediction achieves mIoU $\mathbf{Z}\in\mathbb{R}^{X\times Y\times Z\times C}$ 5 and rayIoU $\mathbf{Z}\in\mathbb{R}^{X\times Y\times Z\times C}$ 6, while the two-frame variant reaches mIoU $\mathbf{Z}\in\mathbb{R}^{X\times Y\times Z\times C}$ 7 and rayIoU $\mathbf{Z}\in\mathbb{R}^{X\times Y\times Z\times C}$ 8. Under identical training settings, BEVDet (camera only) gives mIoU $\mathbf{Z}\in\mathbb{R}^{X\times Y\times Z\times C}$ 9, FB-Occ (camera only) $f:\mathbb{R}^3\to[0,1]^C$ 0, BEVFusion (camera + LiDAR, 1 frame) $f:\mathbb{R}^3\to[0,1]^C$ 1, and HumanoidOcc (camera + LiDAR, 1 frame) $f:\mathbb{R}^3\to[0,1]^C$ 2 with 40.5 M parameters versus 60.6 M. The reported two-frame model is described as best in both accuracy and efficiency.

Humanoid-OmniOcc evaluates voxel IoU, mean IoU, precision, and recall on both held-out simulation and real-world captures (Guo et al., 22 Jun 2026). On the simulated test set, HS reports IoU $f:\mathbb{R}^3\to[0,1]^C$ 3 and mIoU $f:\mathbb{R}^3\to[0,1]^C$ 4, compared with FB-Occ at $f:\mathbb{R}^3\to[0,1]^C$ 5, FlashOcc at $f:\mathbb{R}^3\to[0,1]^C$ 6, SurroundOcc at $f:\mathbb{R}^3\to[0,1]^C$ 7, and GaussianFormer at $f:\mathbb{R}^3\to[0,1]^C$ 8. In real-world evaluation, the best monocular baseline, SurroundOcc, reports IoU $f:\mathbb{R}^3\to[0,1]^C$ 9 and mIoU $C=10$ 0, whereas HS reports IoU $C=10$ 1 and mIoU $C=10$ 2. The ablation on the stereo backbone shows that replacing the default FoundationStereo with LightStereo-S, COEX, or IGEV reduces real-world mIoU from 19.26 to as low as 6–9.

OneOcc evaluates on QuadOcc and Human360Occ using standard per-class IoU, mIoU, precision, and recall on non-empty voxels (Shi et al., 5 Nov 2025). On QuadOcc, OneOcc achieves mIoU $C=10$ 3, precision $C=10$ 4, and recall $C=10$ 5, compared with LMSCNet at mIoU $C=10$ 6 and MonoScene at $C=10$ 7. On Human360Occ, the within-city split yields mIoU $C=10$ 8 versus MonoScene at $C=10$ 9, while the cross-city split yields $[44\times384\times384]$ 0 versus $[44\times384\times384]$ 1. These gains are attributed in the paper to the combined effect of dual projections, bi-grid reasoning, the AMoE-3D decoder, and gait compensation.

MobileOcc introduces a broader benchmark suite that includes dense semantic occupancy, panoptic occupancy, pedestrian detection, and pedestrian velocity prediction (Kim et al., 21 Nov 2025). At $[44\times384\times384]$ 2 m grid resolution, FlashOcc achieves the highest mIoU at approximately $[44\times384\times384]$ 3, ahead of VoxFormer at $[44\times384\times384]$ 4 and Panoptic-FlashOcc at $[44\times384\times384]$ 5. For pedestrian IoU, FlashOcc reports $[44\times384\times384]$ 6, VoxFormer $[44\times384\times384]$ 7, and Panoptic-FlashOcc $[44\times384\times384]$ 8. In panoptic evaluation, Panoptic-FlashOcc gives $[44\times384\times384]$ 9, $X\times Y\times Z$ 00, and $X\times Y\times Z$ 01, while BEVDet4D detection-only yields $X\times Y\times Z$ 02. For velocity prediction, BEVDet4D reports $X\times Y\times Z$ 03- $X\times Y\times Z$ 04 m/s, and Panoptic-FlashOcc-vel reports $X\times Y\times Z$ 05- $X\times Y\times Z$ 06 m/s, $X\times Y\times Z$ 07- $X\times Y\times Z$ 08 m/s, and $X\times Y\times Z$ 09- $X\times Y\times Z$ 10 m/s with mIoU $X\times Y\times Z$ 11. The human mesh optimization stage is also evaluated independently: on 3DPW with synthetic Ouster LiDAR, “Ours (Ouster-64)” achieves MPJPE $X\times Y\times Z$ 12 mm and PA-MPJPE $X\times Y\times Z$ 13 mm without root alignment, while “Ours (Ouster-128)” reaches MPJPE $X\times Y\times Z$ 14 mm and PA-MPJPE $X\times Y\times Z$ 15 mm.

For self-body occupancy, Chen et al. evaluate IoU, MSE, MAE, and Chamfer Distance between predicted and ground-truth point clouds (Chen et al., 11 Jun 2026). With oracle masks the self-model reaches IoU $X\times Y\times Z$ 16, MSE $X\times Y\times Z$ 17, MAE $X\times Y\times Z$ 18, and CD $X\times Y\times Z$ 19 mm; with their pseudo-ground-truth masks at 99.5% accuracy, performance remains near IoU $X\times Y\times Z$ 20, MSE $X\times Y\times Z$ 21, MAE $X\times Y\times Z$ 22, and CD $X\times Y\times Z$ 23 mm. At 80% mask accuracy, IoU drops to $X\times Y\times Z$ 24 and CD rises to $X\times Y\times Z$ 25 mm, while at 50% accuracy the model degrades substantially.

6. Applications, misconceptions, and open problems

The most direct downstream uses of humanoid occupancy are collision-aware locomotion, task-space manipulation, and map-centric navigation. Humanoid Occupancy states that its grid representation supports navigation and path planning with A* and D*, locomotion and footstep planning on uneven terrain, manipulation and grasping with semantic voxels such as “chair,” “table,” and “objects,” and teleoperation or mixed reality through dense occupancy reconstructions (Cui et al., 27 Jul 2025). Humanoid-OmniOcc similarly identifies collision-aware locomotion via DWA and RRT*, arm-reach manipulation planning, and incremental map fusion for long-range navigation, revisiting, and loop-closing as target applications (Guo et al., 22 Jun 2026). MobileOcc frames its contribution explicitly in terms of safer, more human-centered mobile robot navigation in pedestrian-rich near-field environments (Kim et al., 21 Nov 2025).

A recurring misconception is that humanoid occupancy is simply autonomous-driving semantic scene completion with a different robot body. The literature rejects that equivalence in several ways. Humanoid-OmniOcc argues that existing occupancy datasets are predominantly designed for autonomous driving with forward-facing cameras, far-field geometry, and static road priors, which limits applicability to embodied humanoid perception (Guo et al., 22 Jun 2026). OneOcc addresses the specific panoramic continuity and gait-induced jitter of legged and humanoid embodiments (Shi et al., 5 Nov 2025). Humanoid Occupancy emphasizes kinematic interference and self-occlusion in sensor layout (Cui et al., 27 Jul 2025). These are not minor implementation differences; they alter sensing geometry, annotation design, and architectural choices.

Another misconception is that occupancy is only an external world model. The self-other distinction literature shows that a humanoid can learn self-other distinction from proprioceptive-visual correspondence and then train a predictive self-model that maps joint configurations to three-dimensional body occupancy, supporting target reaching, collision-aware motion planning, and human-to-robot motion retargeting (Chen et al., 11 Jun 2026). This suggests that humanoid occupancy has bifurcated into two complementary regimes: exteroceptive scene occupancy and proprioceptively conditioned self-occupancy.

The open problems are equally consistent across papers. Humanoid Occupancy reports limited dataset scale and diversity, pose drift in temporal warping beyond two frames, and residual blind spots near hips and arms (Cui et al., 27 Jul 2025). Humanoid-OmniOcc notes that dynamic agents are not annotated and that material reflectance mismatch can degrade depth priors in specular regions (Guo et al., 22 Jun 2026). MobileOcc observes that vision-only near-field pedestrian occupancy remains weak, with pedestrian IoU below 10%, and that fine-grained motion disambiguation such as forward versus backward walking remains difficult (Kim et al., 21 Nov 2025). In the self-modeling setting, performance depends strongly on pseudo-mask quality, and removing the visibility branch reduces IoU by about 5 points (Chen et al., 11 Jun 2026). A plausible implication is that future humanoid occupancy systems will require simultaneous progress in multimodal sensing, temporally stable calibration, human-instance modeling, and scalable annotation.

Across these works, humanoid occupancy emerges as a general-purpose spatial intermediate rather than a single benchmark task. Its core promise lies in converting heterogeneous sensory streams and body-state signals into dense, robot-centric occupancy structure that is usable by planning, control, and interaction modules. The present literature shows that this promise is technically viable, but also that robustness to dynamic humans, embodiment-specific occlusions, and sim-to-real appearance gaps remains an active research frontier.