Online Object-Level Semantic Mapping for Quadrupeds in Real-World Environments (2510.18776v1)

Published 21 Oct 2025 in cs.RO

Abstract: We present an online semantic object mapping system for a quadruped robot operating in real indoor environments, turning sensor detections into named objects in a global map. During a run, the mapper integrates range geometry with camera detections, merges co-located detections within a frame, and associates repeated detections into persistent object instances across frames. Objects remain in the map when they are out of view, and repeated sightings update the same instance rather than creating duplicates. The output is a compact object layer that can be queried (class, pose, and confidence), is integrated with the occupancy map and readable by a planner. In on-robot tests, the layer remained stable across viewpoint changes.

Summary

The paper presents a real-time mapping system that integrates geometric SLAM with YOLO-based object detection for quadruped robots.
It employs a three-stage approach—geometric mapping, object projection, and semantic association—to maintain persistent and compact object instances.
Experimental results on a Spot robot in indoor environments demonstrate robust performance and efficient resource usage despite sensor alignment challenges.

Online Object-Level Semantic Mapping for Quadrupeds in Real-World Environments

Introduction and Motivation

This paper presents a real-time semantic mapping system for quadruped robots, specifically targeting object-level mapping in indoor environments. The motivation stems from the limitations of geometric-only maps in object-centric navigation tasks, where planners require semantic information to execute goals such as "find the door" or "go to the toolbox." Quadrupeds, exemplified by Boston Dynamics Spot, offer superior mobility in complex terrains compared to wheeled or bipedal platforms, making them suitable candidates for deploying advanced mapping systems in real-world scenarios.

Figure 1: Spot platform with sensor payload, including RGB-D and LiDAR for mapping and detection.

System Architecture and Methodology

The proposed system integrates multiple sensing modalities and lightweight association logic to maintain persistent object instances in a global map. The architecture consists of three main components:

Geometric Mapping: A 2D occupancy grid is constructed online using SLAM Toolbox, fusing 2D LiDAR scans with visual-inertial odometry from the Intel RealSense T265. Static extrinsic calibration ensures consistent transformation between sensor frames and the robot body.
Object Detection and Projection: RGB-D data from the Intel RealSense D435 is processed using YOLOv11 for object detection. Detected bounding boxes are filtered by per-class confidence thresholds and projected into the map using depth information and calibrated transforms. Only the yaw component is retained for 2D mapping, with roll and pitch set to zero.
Semantic Layer: Association and Memory:

The semantic layer merges near-duplicate detections within a frame and associates repeated detections across frames using a nearest-neighbor search in the map. A short-term buffer and a long-term memory list are maintained. Objects are promoted to the long-term list after repeated, confident detections at consistent positions. This approach avoids per-frame duplication and ensures stability of object instances even when out of view.

Figure 2: System overview showing sensor fusion, object detection, and semantic layer integration.

Figure 3: Association pipeline for confirmed objects, detailing short-term and long-term memory logic.

Experimental Results

The system was deployed on a Spot robot equipped with the described sensor payload. In a laboratory environment containing two people and two chairs, the robot was teleoperated to traverse the space. The SLAM system generated a 2D occupancy map, while the semantic layer projected and tracked object detections in real time. Confirmed objects were visualized in RViz as labeled cubes, with hit counts indicating repeated sightings.

The semantic layer demonstrated robust association, maintaining persistent object instances across viewpoint changes and asynchronous sensor streams. Near-duplicate detections were effectively suppressed, resulting in a compact and uncluttered object map. The system tracked only the "person" and "chair" classes, consistent with the experimental setup.

Figure 4: RViz output and lab scene, showing tracked objects and experimental environment.

Performance metrics indicate that the semantic layer operates efficiently within the computational constraints of the onboard NUC, with stable frame rates across detection, mapping, and visualization streams. The YOLO stream publishes only when detections are present, further optimizing resource usage.

Figure 5: Frame rates and message rates for camera, detection, and semantic mapping streams.

Limitations and Implementation Considerations

Several limitations were identified:

Depth Misalignment: The D435 depth measurements are not perfectly aligned with the LiDAR scan plane, introducing pose bias in object placement.
Geometric-Only Association: The system relies solely on geometric proximity for association, without leveraging appearance cues or feature descriptors.
Detector Center Reliance: Object positions are anchored at detection centers, which may not correspond to true object centroids.
Indoor-Only Evaluation: Experiments were limited to indoor environments; generalization to outdoor or more cluttered settings remains untested.

From an implementation perspective, the system is designed for real-time operation on resource-constrained hardware. The avoidance of heavy 3D fusion and vision-language scoring loops ensures low latency and power consumption, making it suitable for deployment on mobile robots with limited compute.

Implications and Future Directions

The presented methodology enables planners to query object-level semantic information in real time, facilitating object-centric navigation and task execution. The lightweight association logic and memory management are well-suited for onboard deployment in industrial, search-and-rescue, and laboratory scenarios.

Future work should address depth alignment issues, incorporate appearance-based association for improved robustness, and extend evaluation to more diverse environments. Integration with autonomous navigation systems leveraging the semantic layer for goal-oriented planning is a logical next step. Additionally, expanding the system to support open-vocabulary detection and dynamic object classes would enhance flexibility and applicability.

Conclusion

This paper introduces a practical, online semantic mapping system for quadruped robots, combining geometric SLAM with object-level detection and association. The approach maintains persistent, queryable object instances in a global map, supporting real-time operation under computational constraints. While limitations exist in depth alignment and association logic, the system provides a solid foundation for object-centric navigation and task planning in real-world environments.