BoxFusion: Reconstruction-Free Open-Vocabulary 3D Object Detection via Real-Time Multi-View Box Fusion (2506.15610v1)

Published 18 Jun 2025 in cs.CV

Abstract: Open-vocabulary 3D object detection has gained significant interest due to its critical applications in autonomous driving and embodied AI. Existing detection methods, whether offline or online, typically rely on dense point cloud reconstruction, which imposes substantial computational overhead and memory constraints, hindering real-time deployment in downstream tasks. To address this, we propose a novel reconstruction-free online framework tailored for memory-efficient and real-time 3D detection. Specifically, given streaming posed RGB-D video input, we leverage Cubify Anything as a pre-trained visual foundation model (VFM) for single-view 3D object detection by bounding boxes, coupled with CLIP to capture open-vocabulary semantics of detected objects. To fuse all detected bounding boxes across different views into a unified one, we employ an association module for correspondences of multi-views and an optimization module to fuse the 3D bounding boxes of the same instance predicted in multi-views. The association module utilizes 3D Non-Maximum Suppression (NMS) and a box correspondence matching module, while the optimization module uses an IoU-guided efficient random optimization technique based on particle filtering to enforce multi-view consistency of the 3D bounding boxes while minimizing computational complexity. Extensive experiments on ScanNetV2 and CA-1M datasets demonstrate that our method achieves state-of-the-art performance among online methods. Benefiting from this novel reconstruction-free paradigm for 3D object detection, our method exhibits great generalization abilities in various scenarios, enabling real-time perception even in environments exceeding 1000 square meters.

Summary

The paper introduces BoxFusion, a novel reconstruction-free framework for real-time open-vocabulary 3D object detection that fuses multi-view bounding boxes using visual foundation models.
BoxFusion achieves state-of-the-art performance among online methods, detecting objects at over 20 FPS with only 7GB GPU memory, demonstrating significant efficiency improvements over reconstruction-based approaches.
This approach enables efficient 3D object detection in memory-constrained environments, enhancing real-world applications like autonomous driving and embodied AI where real-time performance is crucial.

An Expert Analysis of "BoxFusion: Reconstruction-Free Open-Vocabulary 3D Object Detection via Real-Time Multi-View Box Fusion"

The paper "BoxFusion: Reconstruction-Free Open-Vocabulary 3D Object Detection via Real-Time Multi-View Box Fusion" addresses a critical challenge in 3D object detection—specifically, the computational and memory burdens associated with dense point cloud reconstruction. Through the introduction of a novel reconstruction-free framework, the authors aim to facilitate real-time 3D object detection in memory-constrained environments. This approach has significant implications for applications in autonomous driving and embodied AI, where efficiency and adaptability to varied object types are essential.

Methodological Overview

The paper's primary contribution lies in proposing a novel framework for detecting 3D objects without reconstructing dense point clouds. The framework leverages a visual foundation model (VFM), Cubify Anything, combined with CLIP for capturing open-vocabulary semantics. Key components of this method include:

Single-view 3D Object Detection: Utilizing VFMs for the initial bounding box proposal in single views, the system efficiently handles streaming RGB-D video inputs without requiring extensive reconstruction of scenes.
Multi-View Box Fusion: A distinct feature of this approach is the fusion of detected bounding boxes from different views into a unified representation. This is achieved through an association module that involves 3D Non-Maximum Suppression (NMS) and a box correspondence matching module. The process ensures that the object's real-world scale, position, and semantics are consistently captured.
Optimization Module: Particle filtering, guided by an Intersection-over-Union (IoU) optimization technique, refines the 3D bounding boxes to maintain consistency across multiple views, highlighting its memory efficiency and reduced computational burden.

Results and Evaluation

The proposed method, BoxFusion, outperforms contemporary online methods as demonstrated in extensive experiments conducted on the ScanNetV2 and CA-1M datasets. Particular strengths are noted in runtime efficiency and adaptability to a broad range of object types, validating the paper's emphasis on eliminating reconstruction for real-time performance. The reported system achieves detection at over 20 frames per second while maintaining significant memory efficiency, using only 7GB of GPU memory even in complex environments.

Implications and Future Work

From a theoretical perspective, this research advances the understanding of 3D object detection in constrained environments and provides a scalable architecture that may influence future developments in the field. Practically, the reduced computational demands and real-time capabilities enhance the applicability of this method in real-world scenarios, particularly in autonomous navigation systems where rapid object detection is crucial.

Questions remain concerning the scalability of the box fusion technique and its performance in more complex environments with moving objects or dynamic scenes. Further research could explore the integration of this technique with other sensory data, such as LiDAR, to enhance robustness and accuracy.

Conclusion

The "BoxFusion" framework represents a significant step towards efficient and adaptable 3D object detection, bypassing the limitations of dense point clouds. By marrying state-of-the-art foundation models with innovative multi-view strategies, the paper underscores the potential for advanced detection capabilities in real-time applications, setting the stage for future developments in AI-assisted perception systems. As this field progresses, methodologies like BoxFusion could become integral to smart city surveillance systems, interactive robotics, and beyond, where the demand for real-time, open-vocabulary object recognition continues to grow.

YouTube

Show All Videos