Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
153 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MaskFusion: Real-Time Recognition, Tracking and Reconstruction of Multiple Moving Objects (1804.09194v2)

Published 24 Apr 2018 in cs.CV and cs.RO

Abstract: We present MaskFusion, a real-time, object-aware, semantic and dynamic RGB-D SLAM system that goes beyond traditional systems which output a purely geometric map of a static scene. MaskFusion recognizes, segments and assigns semantic class labels to different objects in the scene, while tracking and reconstructing them even when they move independently from the camera. As an RGB-D camera scans a cluttered scene, image-based instance-level semantic segmentation creates semantic object masks that enable real-time object recognition and the creation of an object-level representation for the world map. Unlike previous recognition-based SLAM systems, MaskFusion does not require known models of the objects it can recognize, and can deal with multiple independent motions. MaskFusion takes full advantage of using instance-level semantic segmentation to enable semantic labels to be fused into an object-aware map, unlike recent semantics enabled SLAM systems that perform voxel-level semantic segmentation. We show augmented-reality applications that demonstrate the unique features of the map output by MaskFusion: instance-aware, semantic and dynamic.

Citations (329)

Summary

  • The paper introduces a real-time RGB-D SLAM system that integrates Mask-RCNN-based instance segmentation to track and reconstruct multiple moving objects.
  • The methodology synchronizes SLAM and semantic segmentation to maintain dynamic, object-aware maps with reduced trajectory errors and enhanced 3D reconstruction accuracy.
  • The results demonstrate improved performance in dynamic scenes, offering practical benefits for mobile robotics and augmented reality through precise object tracking.

MaskFusion: A Novel Approach to Real-Time Dynamic SLAM

The paper "MaskFusion: Real-Time Recognition, Tracking and Reconstruction of Multiple Moving Objects" presents a sophisticated approach in the field of Visual Simultaneous Localization and Mapping (SLAM), addressing limitations in traditional SLAM systems related to dynamic environments and semantic awareness. This paper introduces MaskFusion, a real-time RGB-D SLAM system that extends beyond static scene mapping by incorporating recognition, segmentation, and tracking of multiple independent moving objects without predefined object models.

The primary contribution of MaskFusion is its ability to recognize and accurately track independent moving objects within a scene, simultaneously reconstructing their geometry while providing semantic labels. Unlike prior systems limited to voxel-level semantic segmentation, MaskFusion leverages instance-level semantic segmentation. It effectively fuses semantic labels into an object-aware map, using the power of Mask-RCNN for instance-level object segmentation. This feature allows the system to differentiate between object instances across 80 object classes.

The system's architecture addresses key deficits in traditional SLAM systems: the assumption of a static environment and the absence of semantic context in mapping. MaskFusion's capability to handle multiple independent motions and dynamically update object-aware maps represents significant progress in SLAM research. It aligns with its practicality in enhanced robotic navigation and enhanced augmented reality applications.

Real-time performance is achieved by synchronizing the SLAM and masking components. The semantic segmentation operates asynchronously on a dedicated hardware setup, which permits the SLAM component to maintain dynamic maps at high frame rates, essential for applications involving fast motion or interaction with multiple objects. The segmentation precision is augmented by combining semantic outputs with geometric cues, mitigating imperfections such as boundary leaks in instance masks output by Mask-RCNN.

Experimental evaluations in dynamic scenarios using established benchmarks showcased the robustness of MaskFusion in comparison with existing SLAM systems, particularly in highly dynamic environments. It demonstrated reduced trajectory errors and consistently improved tracking accuracy in the presence of moving objects. Furthermore, quantitative assessments showed competitive 3D reconstruction accuracy against ground truth models, demonstrating the capability of MaskFusion to maintain precise and consistent tracks of recognized objects.

The paper also highlights the robustness of MaskFusion's object segmentation, validated by Intersection over Union (IoU) metrics against ground truth annotations. The enhancement achieved by combining Mask-RCNN with geometric segmentation is evident in the improved segmentation results throughout tested sequences.

In terms of theoretical implications, this research underscores the feasibility and advantages of integrating semantic understanding with geometric reconstruction in real-time dynamic SLAM frameworks. Pragmatically, the potential extends to various domains requiring real-time object tracking and scene mapping, including mobile robotics and context-aware augmented reality, where system adaptability to dynamic changes is crucial.

Future work could explore further expansions of MaskFusion's framework, addressing the limitations inherently tied to the semantic segmentation's dependency on specific object classes set by the MS-COCO dataset, and improving the handling of non-rigid objects. Advances could also include incorporating online learning mechanisms to recognize novel objects or classes dynamically encountered in a new environment.

In conclusion, the MaskFusion system proposed in this paper offers an innovative solution in the domain of dynamic SLAM tasks. Its methodology, encompassing real-time semantic segmentation integrated with robust reconstruction and tracking algorithms, positions it as a transformative tool that sets a precedent for future research in adaptable, semantic SLAM systems.