Papers
Topics
Authors
Recent
Search
2000 character limit reached

MobileOcc: 3D Human-Aware Occupancy for Robots

Updated 23 June 2026
  • MobileOcc is a human-aware 3D semantic occupancy dataset that integrates synchronized RGB images and LiDAR data to model dynamic pedestrian environments.
  • It employs a novel human mesh optimization framework that refines full SMPL body meshes through multi-stage alignment and joint optimization using monocular and LiDAR cues.
  • The dataset standardizes benchmarks for semantic occupancy and pedestrian velocity prediction with reproducible metrics, advancing research in dynamic, human-populated robotic navigation.

MobileOcc is a human-aware 3D semantic occupancy dataset specifically designed for mobile robots navigating in pedestrian-rich environments. Unlike prior datasets focused on autonomous driving, MobileOcc addresses the underexplored challenge of dense 3D occupancy perception and human motion understanding in crowded scenes encountered by indoor and campus mobile platforms. Its annotation pipeline incorporates a novel mesh optimization framework for non-rigid human occupancy modeling, combining monocular image cues and LiDAR observations to reconstruct and refine deformable human geometry. MobileOcc standardizes two benchmarks—semantic occupancy and pedestrian velocity prediction—across monocular, stereo, and panoptic architectures with reproducible metrics and baseline implementations (Kim et al., 21 Nov 2025).

1. Data Acquisition and Preprocessing

MobileOcc utilizes synchronized RGB images and LiDAR sweeps from the CODa (UT Campus Object) dataset. Preprocessing involves:

  • 2D pedestrian tracking: YOLOX with OC-SORT provides per-frame bounding boxes and track IDs for each pedestrian.
  • Keypoint and mask estimation: ViTPose estimates 2D skeletons; Mask2Former supplies dense per-instance masks for visible objects.
  • LiDAR-to-image association: Each LiDAR point is associated with its nearest image-segmentation mask using time-aligned sensor data.
  • Semantic segmentation: A Cityscapes-trained model delivers per-pixel semantic labels for static classes such as road, vegetation, and pole.

All detections and associations are referenced in a robot-local frame, enabling explicit exclusion of dynamic pedestrian points from the static map.

2. Static Mapping and Octree-Based Voxel Fusion

MobileOcc constructs a high-fidelity static map by:

  • Pedestrian masking: Applying a lidar-based 3D pedestrian detector at each sweep to remove points inside detected pedestrian bounding boxes.
  • Octree-based fusion: Aggregating all non-pedestrian LiDAR points over multiple sweeps using OctoMap, preserving per-voxel semantic class vote counts.
  • Free/unknown labeling: Each voxel is labeled as "occupied" (with a Cityscapes semantic label), "free," or "unknown" by querying the occupancy map.

The resulting axis-aligned voxel grid is parameterizable in resolution (benchmark: $0.2$ m, offering up to $0.02$ m), covering x[0.4,10.0]x\in[0.4,10.0] m, y[4.8,4.8]y\in[-4.8,4.8] m, z[1.0,3.8]z\in[-1.0,3.8] m.

3. Human Mesh Optimization Framework

Human occupancy is modeled per instance using full SMPL body meshes, optimized by a multistage procedure:

  1. Initial SMPL prediction: CLIFF, a human mesh recovery (HMR) network, regresses SMPL pose (θ\theta), shape (β\beta), and global translation (tcamt_{\rm cam}), yielding an initial mesh M0\mathcal M_0.
  2. Visibility filtering: The visible subset VM0\mathcal V\subset\mathcal M_0 is selected by
    • Backface culling: Retaining only front-facing triangles by thresholding $0.02$0.
    • Keypoint occlusion: Discarding body parts with all 2D keypoints below a ViTPose confidence threshold $0.02$1.
  3. Coarse ICP alignment: Rigid registration aligns filtered SMPL vertices $0.02$2 to LiDAR points $0.02$3 by solving

$0.02$4

adjusting global translation and orientation.

  1. Joint optimization: Full SMPL parameters are refined by minimizing a composite objective:

$0.02$5

where - $0.02$6: 3D–2D joint reprojection loss - $0.02$7: symmetric Chamfer distance mesh–LiDAR alignment - $0.02$8: Gaussian-mixture pose prior - $0.02$9: anti-hyperextension penalty - x[0.4,10.0]x\in[0.4,10.0]0: quadratic shape prior - x[0.4,10.0]x\in[0.4,10.0]1: occlusion-aware pose consistency.

Optimization is performed in PyTorch via gradient descent.

4. Occupancy Grid Fusion and Labeling

All modalities—static semantic voxels, dynamic human meshes, and free/unknown space predictions—are fused into a common occupancy tensor in the robot’s local frame. Label priority dictates voxel annotation:

  1. Dynamic humans: Per-instance pedestrian ID with associated mesh occupancy.
  2. Static semantics: Cityscapes class assignment for non-pedestrian-occupied voxels.
  3. Free/unknown: Inherited directly from the OctoMap.

This unified occupancy grid supports both per-class semantic and per-instance dynamic interpretation.

5. Benchmark Tasks and Metrics

MobileOcc establishes two primary tasks:

Semantic Occupancy Prediction:

For x[0.4,10.0]x\in[0.4,10.0]2 historical frames, predict a dense tensor x[0.4,10.0]x\in[0.4,10.0]3 of per-voxel class assignments,

x[0.4,10.0]x\in[0.4,10.0]4

where x[0.4,10.0]x\in[0.4,10.0]5 is the predicted probability for class x[0.4,10.0]x\in[0.4,10.0]6.

Pedestrian Velocity Prediction:

For all human-occupied voxels, regress a 2D velocity vector x[0.4,10.0]x\in[0.4,10.0]7. The network outputs x[0.4,10.0]x\in[0.4,10.0]8, supervised with L1 or L2 loss against ground-truth velocities.

Evaluation metrics are computed over 24,208 test frames and include:

Task Key Metric(s) Formula(s)
Occupancy x[0.4,10.0]x\in[0.4,10.0]9, accuracy, mIoU y[4.8,4.8]y\in[-4.8,4.8]0; mIoU=y[4.8,4.8]y\in[-4.8,4.8]1IoUy[4.8,4.8]y\in[-4.8,4.8]2
Detection F1, precision, recall, AP y[4.8,4.8]y\in[-4.8,4.8]3
Panoptic PQ, RQ, SQ, APy[4.8,4.8]y\in[-4.8,4.8]4
Velocity AVE-T, AVE-D, AVE-O y[4.8,4.8]y\in[-4.8,4.8]5

6. Baseline Methods and Implementation Details

MobileOcc adapts standardized occupancy and velocity frameworks:

  • Monocular:
    • BEVDet4D: Temporal BEV features, L1 velocity regression.
    • FlashOcc: Semantic occupancy via channel-to-height plugin, cross-entropy supervision.
    • Panoptic-FlashOcc: Adds instance-center head, panoptic-PQ loss.
    • Panoptic-FlashOcc-vel: Panoptic occupancy with L1 velocity head.
  • Stereo:
    • VoxFormer-T: Two-stage—stereo depth to coarse voxel, then upsampling—supervised by cross-entropy.

All methods utilize a y[4.8,4.8]y\in[-4.8,4.8]6 m voxel grid volume, standard augmentations, and sequence lengths of 8 frames (monocular) or 4 alternating stereo inputs.

7. Experimental Results and Qualitative Analysis

Occupancy Prediction:

  • FlashOcc achieves the highest mIoU (≈31.8%); Panoptic-FlashOcc slightly trails (≈31.0%, with panoptic consistency); VoxFormer-T remains competitive for “pedestrian” class (≈31.9% IoU).

Per-class IoU (selected):

  • Pedestrian: FlashOcc ~32.5%, VoxFormer-T ~31.9%, Panoptic-FlashOcc ~31.8%.
  • Car: All methods >70% (noting low occurrence outdoors).

Panoptic Occupancy:

  • Panoptic-FlashOcc (8 frames): PQ=19.9%, relaxed PQ=28.1%, RQ=65.8%, SQ=32.6%, PQy[4.8,4.8]y\in[-4.8,4.8]7=42.5%, APy[4.8,4.8]y\in[-4.8,4.8]8=45.5%. BEVDet4D provides 2D-detection APy[4.8,4.8]y\in[-4.8,4.8]9=41.7%.

Pedestrian Velocity:

  • Panoptic-FlashOcc-vel: AVE-T=0.97 m/s, AVE-D=0.39 m/s, AVE-O=0.67 m/s, with mIoU ≈26.0%.
  • BEVDet4D: AVE-T=1.00 m/s, AVE-D=0.36 m/s.

Qualitative observations: Human meshes capture complex, non-rigid pedestrian poses in crowded scenes; 0.02 m resolution occupancy grids delineate humans versus background; robustness to various lighting conditions; velocity maps show plausible motion fields in constrained environments.

8. Significance and Future Directions

MobileOcc introduces a first-of-its-kind, human-aware semantic occupancy dataset tailored to mobile robotics in human-populated settings. Its mesh optimization framework, combining monocular priors and LiDAR through a unified multi-term objective,

z[1.0,3.8]z\in[-1.0,3.8]0

enables accurate deformable-body reconstructions. By providing standardized baselines and benchmarks for semantic occupancy and pedestrian velocity, MobileOcc enables progress towards fine-grained, dense human-aware spatial perception and illuminates persistent challenges for mobile robot navigation in dynamic environments (Kim et al., 21 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to UWBCarGraz.