MobileOcc: Semantic 3D Occupancy for Mobile Robots

Updated 23 June 2026

MobileOcc is a semantic 3D occupancy dataset that focuses on human-aware spatial perception in crowded, pedestrian-rich environments using synchronized RGB and LiDAR data.
The dataset employs an end-to-end annotation pipeline that integrates static mapping, human mesh optimization, and multi-modal fusion to produce precise occupancy grids.
It establishes standardized benchmarks for semantic occupancy prediction and pedestrian velocity estimation, advancing research in dynamic, human-centric 3D perception.

MobileOcc is a large-scale semantic 3D occupancy dataset designed specifically for mobile robots operating in crowded, pedestrian-rich real-world environments. Unlike prior datasets focused on autonomous driving, MobileOcc emphasizes dense, human-aware spatial perception for mobile platforms. The dataset is constructed via an end-to-end annotation pipeline that integrates static-object occupancy, deformable human mesh modeling, and multi-modal fusion with synchronized RGB and LiDAR data. MobileOcc establishes standardized benchmarks for both semantic occupancy prediction and pedestrian velocity estimation, supporting reproducible research in dense 3D perception under challenging conditions (Kim et al., 21 Nov 2025).

1. Data Acquisition and Preprocessing Pipeline

MobileOcc is founded on the CODa (UT Campus Object) dataset, employing synchronized RGB video and LiDAR sweeps. The annotation workflow consists of four primary phases: data preprocessing, static map construction, human mesh optimization, and final occupancy grid assembly.

2D Pedestrian Tracking: Per-frame bounding boxes with consistent temporal IDs are generated using YOLOX combined with OC-SORT.
Keypoint and Instance Mask Estimation: Human skeletons are predicted using ViTPose, while dense masks are produced by Mask2Former.
LiDAR-to-Image Association: Each LiDAR point is assigned to the nearest mask pixel at the same timestamp, facilitating multi-modal correspondence.
Semantic Segmentation: Static classes (e.g., road, vegetation, pole) are labeled by a Cityscapes-trained network, ensuring per-pixel accuracy and enabling separation of dynamic and static elements. All modalities are timestamp-aligned in a robot-local frame to support spatiotemporal consistency.

2. Static Environment Mapping

The static scene representation is constructed via:

3D Pedestrian Masking: A LiDAR-based 3D detector is applied per sweep. LiDAR points within detected pedestrian bounding boxes are excluded to prevent contamination of the static map by dynamic agents.
Octree-Based Fusion: The remaining static points are fused into an OctoMap using per-voxel semantic class votes, aggregating information over multiple sweeps [Hornung et al., 2013].
Voxel Categorization: Each voxel is labeled as "occupied" (with Cityscapes label), "free," or "unknown" based on OctoMap queries. This stratification underpins downstream occupancy annotation.

3. Human Mesh Optimization and Fusion

MobileOcc employs a novel multi-stage optimization to accurately reconstruct deformable human occupancy from fused monocular and LiDAR cues:

Initial SMPL Parameter Regression: The HMR network CLIFF predicts SMPL mesh parameters—joint rotations $\theta$ , shape $\beta$ , and global translation $t_{\rm cam}$ —yielding an initial mesh $\mathcal{M}_0$ .
Visibility Filtering: Only vertices on faces that are front-facing (determined by mesh normals and the camera origin) and unoccluded (according to ViTPose keypoint confidence) are retained, resulting in a visible set $\mathcal{V} \subset \mathcal{M}_0$ .
Rigid ICP Alignment: The filtered mesh is registered to the corresponding LiDAR point cloud $\mathcal{P}$ using a least-squares SE(3) alignment:

$\mathbf{T}^* = \arg\min_{T \in SE(3)} \sum_{\mathbf{v} \in \mathcal{V}} \min_{\mathbf{p} \in \mathcal{P}} \| T\mathbf{v} - \mathbf{p} \|_2^2$

This step adjusts the global mesh pose but freezes the non-rigid components.

Joint SMPL–LiDAR Optimization: All mesh parameters $(\beta, \theta, t_{\rm cam})$ are refined via gradient descent by minimizing:

$\mathcal{L}_{\rm total} = \mathcal{L}_J + \lambda_{3D}\mathcal{L}_{3D} + \lambda_\theta\mathcal{L}_\theta + \lambda_a\mathcal{L}_a + \lambda_\beta\mathcal{L}_\beta + \lambda_{\rm occ}\mathcal{L}_{\theta,\rm occ}$

with loss terms for 2D joint reprojection, Chamfer distance mesh–LiDAR alignment, pose and shape priors, anti-hyperextension, and occlusion-consistency.

After human mesh alignment, all spatial elements (LiDAR static map, optimized meshes) are fused into a voxel grid in the robot-local $(x, y, z)$ frame at user-selectable resolution (benchmark: $\beta$ 0 m). Voxel priority is: dynamic human (instance ID), static (Cityscapes class), free, or unknown.

4. Benchmark Tasks and Baseline Frameworks

MobileOcc supports two principal tasks, targeting dense spatial reasoning and human-aware robot navigation:

Task	Output Structure	Supervision
Semantic Occupancy Prediction	$\beta$ 1	Voxel cross-entropy
Pedestrian Velocity Prediction	$\beta$ 2	L1/L2 loss on vel.

Semantic Occupancy Prediction: Given $\beta$ 3 historical frames (monocular or stereo), the goal is to infer the discrete semantic label for every voxel in the visible scene. Output probabilities are collapsed via $\beta$ 4 at the voxel level.
Pedestrian Velocity Prediction: For each human-occupied voxel, a 2D velocity vector $\beta$ 5 is regressed, supervised by ground-truth flow.

Three representative baseline frameworks are adapted for these tasks:

Monocular:
- BEVDet4D: Uses temporal BEV features with an L1 velocity head.
- FlashOcc: Implements voxel-level cross-entropy with a channel-to-height plugin.
- Panoptic-FlashOcc: Augments with instance-center head and Panoptic-PQ loss.
- Panoptic-FlashOcc-vel: Further incorporates velocity regression.
Stereo:
- VoxFormer-T: Two-stage voxelization leveraging stereo depth, supervised with cross-entropy.

All models use a voxel grid of $\beta$ 6 m resolution, with 8-frame histories (monocular) or 4 alternating frames (stereo).

5. Quantitative Evaluation Metrics

Performance is rigorously assessed using designated metrics over a held-out test split of 24,208 annotated frames:

Intersection-over-Union (IoU) per semantic class:

$\beta$ 7

Mean IoU (mIoU): $\beta$ 8
Accuracy: Fraction of correctly labeled voxels.
F1 Score per class, with precision/recall definitions as in the source dataset.
Panoptic Quality (PQ): Stuff + thing + center-head metrics for Panoptic-FlashOcc.
Average Precision (AP) for pedestrian centers at distances $\beta$ 9 m.
Absolute Velocity Error (AVE):

$t_{\rm cam}$ 0

Reported as AVE-T (all GT voxels), AVE-D (true positive, $t_{\rm cam}$ 11 m), AVE-O (correctly classified).

6. Empirical Results and Analysis

6.1 Semantic Occupancy

On the standard semantic grid ( $t_{\rm cam}$ 2 m, $t_{\rm cam}$ 3 m), baseline performance is as follows:

FlashOcc (8 frames): mIoU ≈ 31.8% (highest overall).
Panoptic-FlashOcc: mIoU ≈ 31.0% (offers panoptic consistency with minimal decrease).
VoxFormer-T: Competitive pedestrian IoU, benefits from stereo depth cues.

Per-class IoU (pedestrian):

Method	Pedestrian IoU
FlashOcc	32.5%
VoxFormer-T	31.9%
Panoptic-FO	31.8%

For "car," all methods >70%.

6.2 Panoptic Occupancy

Panoptic-FlashOcc (8-frame):

PQ = 19.9%
PQ† (relaxed) = 28.1%
RQ = 65.8%
SQ = 32.6%
Pedestrian-only: PQ^Ped = 42.5%, AP^Ped = 45.5%

BEVDet4D achieves 2D-detection AP^Ped = 41.7% (no panoptic prediction).

6.3 Pedestrian Velocity Prediction

Panoptic-FlashOcc-vel: AVE-T = 0.97 m/s, AVE-D = 0.39 m/s, AVE-O = 0.67 m/s; mIoU ≈ 26.0%
BEVDet4D: AVE-T = 1.00 m/s, AVE-D = 0.36 m/s

6.4 Qualitative Observations

High-resolution SMPL meshes capture non-rigid pedestrian poses in complex, occluded scenes.
Fine-grained occupancy grids (up to 0.02 m) sharply demarcate dynamic from static elements.
Occupancy prediction is robust across diverse lighting conditions (sunny, cloudy, night).
Velocity fields are spatially coherent, with plausible vector flows in tight, cluttered spaces.

7. Significance and Implications

MobileOcc establishes a new paradigm for human-centric 3D occupancy datasets for robotic perception. Its pipeline—fusing monocular SMPL priors and LiDAR via a single objective function—yields faithful, physically-grounded reconstructions of deformable human actors. The dataset benchmarks reveal that although monocular methods (FlashOcc) attain high mIoU, stereo-based approaches retain advantages for certain classes, especially under occlusion.

The provided benchmarks highlight both the potential and the limitations of current 3D perception algorithms when tasked with accurate, instance-specific, temporally-aware human modeling in dense crowds. Given MobileOcc’s methodological advances in mesh optimization and multi-modal fusion, further research on dynamic, human-rich navigation and scene understanding for mobile robots is suggested (Kim et al., 21 Nov 2025).

Markdown Report Issue Upgrade to Chat

References (1)

MobileOcc: A Human-Aware Semantic Occupancy Dataset for Mobile Robots (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MobileOcc.