3D-GRES: Generalized 3D Referring Expression Segmentation (2407.20664v2)
Abstract: 3D Referring Expression Segmentation (3D-RES) is dedicated to segmenting a specific instance within a 3D space based on a natural language description. However, current approaches are limited to segmenting a single target, restricting the versatility of the task. To overcome this limitation, we introduce Generalized 3D Referring Expression Segmentation (3D-GRES), which extends the capability to segment any number of instances based on natural language instructions. In addressing this broader task, we propose the Multi-Query Decoupled Interaction Network (MDIN), designed to break down multi-object segmentation tasks into simpler, individual segmentations. MDIN comprises two fundamental components: Text-driven Sparse Queries (TSQ) and Multi-object Decoupling Optimization (MDO). TSQ generates sparse point cloud features distributed over key targets as the initialization for queries. Meanwhile, MDO is tasked with assigning each target in multi-object scenarios to different queries while maintaining their semantic consistency. To adapt to this new task, we build a new dataset, namely Multi3DRes. Our comprehensive evaluations on this dataset demonstrate substantial enhancements over existing models, thus charting a new path for intricate multi-object 3D scene comprehension. The benchmark and code are available at https://github.com/sosppxo/MDIN.
- Referit3d: Neural listeners for fine-grained 3d object identification in real-world scenes. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I 16. Springer, 422–440.
- End-to-end object detection with transformers. In European conference on computer vision. Springer, 213–229.
- Scanrefer: 3d object localization in rgb-d scans using natural language. In European conference on computer vision. Springer, 202–221.
- Language conditioned spatial relation reasoning for 3d object grounding. Advances in neural information processing systems 35 (2022), 20522–20535.
- Back-tracing representative points for voting-based 3d object detection in point clouds. In CVPR. 8963–8972.
- Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition. 5828–5839.
- Instructdet: Diversifying referring object detection with generalized instructions. arXiv preprint arXiv:2310.05136 (2023).
- Visual grounding via accumulated attention. In Proceedings of the IEEE conference on computer vision and pattern recognition. 7746–7755.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
- Vision-language transformer and query generation for referring segmentation. In ICCV. 16321–16330.
- VLT: Vision-language transformer and query generation for referring segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 45, 6 (2023).
- Scene Graph as Pivoting: Inference-time Image-free Unsupervised Multimodal Machine Translation with Visual Scene Hallucination. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics. 5980–5994.
- Dysen-VDM: Empowering Dynamics-aware Text-to-Video Diffusion with LLMs. In CVPR. 7641–7653.
- Enhancing video-language representations with structural spatio-temporal alignment. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024).
- Free-form description guided 3d visual graph network for object grounding in point cloud. In ICCV. 3722–3731.
- 3d semantic segmentation with submanifold sparse convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 9224–9232.
- Transrefer3d: Entity-and-relation aware transformer for fine-grained 3d visual grounding. In Proceedings of the 29th ACM International Conference on Multimedia. 2344–2352.
- Shuting He and Henghui Ding. 2024. RefMask3D: Language-Guided Transformer for 3D Referring Segmentation. arXiv preprint arXiv:2407.18244 (2024).
- SegPoint: Segment Any Point Cloud via Large Language Model. arXiv preprint arXiv:2407.13761 (2024).
- GREC: Generalized Referring Expression Comprehension. arXiv preprint arXiv:2308.16182 (2023).
- Learning to compose and reason with language tree structures for visual grounding. IEEE transactions on pattern analysis and machine intelligence 44, 2 (2019), 684–696.
- Modeling relationships in referential expressions with compositional modular networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1115–1124.
- Segmentation from natural language expressions. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14. Springer, 108–124.
- Natural language object retrieval. In Proceedings of the IEEE conference on computer vision and pattern recognition. 4555–4564.
- Beyond one-to-one: Rethinking the referring image segmentation. In ICCV. 4067–4077.
- Bi-directional relationship inferring network for referring image segmentation. In CVPR. 4424–4433.
- Text-guided graph neural networks for referring 3d instance segmentation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 1610–1618.
- Dense Object Grounding in 3D Scenes. In Proceedings of the 31st ACM International Conference on Multimedia. 5017–5026.
- Two-stage visual cues enhancement network for referring image segmentation. In Proceedings of the 29th ACM international conference on multimedia. 1331–1340.
- Locate then segment: A strong pipeline for referring image segmentation. In CVPR. 9858–9867.
- Referitgame: Referring to objects in photographs of natural scenes. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 787–798.
- Mask-attention-free transformer for 3d instance segmentation. In ICCV. 3693–3703.
- Loic Landrieu and Martin Simonovsky. 2018. Large-scale point cloud semantic segmentation with superpoint graphs. In Proceedings of the IEEE conference on computer vision and pattern recognition. 4558–4567.
- Referring image segmentation via recurrent refinement networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5745–5753.
- A Unified Framework for 3D Point Cloud Visual Grounding. arXiv:2308.11887 [cs.CV]
- Gres: Generalized referring expression segmentation. In CVPR. 23592–23601.
- Multi-modal mutual attention and iterative interaction for referring image segmentation. IEEE Transactions on Image Processing (2023).
- Instance-specific feature propagation for referring segmentation. IEEE Transactions on Multimedia (2022).
- Learning to assemble neural module tree networks for visual grounding. In ICCV. 4673–4682.
- Remoteclip: A vision language foundation model for remote sensing. IEEE Transactions on Geoscience and Remote Sensing (2024).
- CARIS: Context-aware referring image segmentation. In Proceedings of the 31st ACM International Conference on Multimedia. 779–788.
- Improving referring expression grounding with cross-modal attention-guided erasing. In CVPR. 1950–1959.
- Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019).
- Group-free 3d object detection via transformers. In ICCV. 2949–2958.
- Cascade grouped attention network for referring expression segmentation. In Proceedings of the 28th ACM International Conference on Multimedia. 1274–1282.
- Multi-task collaborative network for joint referring expression comprehension and segmentation. In CVPR. 10034–10043.
- 3d-sps: Single-stage 3d visual grounding via referred point progressive selection. In CVPR. 16454–16463.
- Towards local visual modeling for image captioning. Pattern Recognition 138 (2023), 109420.
- X-clip: End-to-end multi-grained contrastive learning for video-text retrieval. In Proceedings of the 30th ACM International Conference on Multimedia. 638–647.
- X-mesh: Towards fast and accurate text-driven 3d stylization via dynamic textual guidance. In ICCV. 2749–2760.
- Generation and comprehension of unambiguous object descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition. 11–20.
- V-net: Fully convolutional neural networks for volumetric medical image segmentation. In 2016 fourth international conference on 3D vision (3DV). Ieee, 565–571.
- Carsten Moenning and Neil A Dodgson. 2003. Fast marching farthest point sampling. Technical Report. University of Cambridge, Computer Laboratory.
- Modeling context between objects for referring expression understanding. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part IV 14. Springer, 792–807.
- Deep hough voting for 3d object detection in point clouds. In ICCV. 9277–9286.
- X-RefSeg3D: Enhancing Referring 3D Instance Segmentation via Structured Cross-Modal Graph Neural Networks. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 4551–4559.
- Zero-shot grounding of objects from natural language queries. In ICCV. 4694–4703.
- Generating semantically precise scene graphs from textual descriptions for improved image retrieval. In Proceedings of the fourth workshop on vision and language. 70–80.
- From points to parts: 3d object detection from point cloud with part-aware and part-aggregation network. IEEE transactions on pattern analysis and machine intelligence 43, 8 (2020), 2647–2664.
- Referring expression comprehension using language adaptive inference. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 37. 2357–2365.
- Superpoint transformer for 3d scene instance segmentation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 37. 2393–2401.
- Attention is all you need. Advances in neural information processing systems 30 (2017).
- Neighbourhood watch: Referring expression comprehension via language-guided graph attention networks. In CVPR. 1960–1968.
- Unveiling Parts Beyond Objects: Towards Finer-Granularity Referring Expression Segmentation. arXiv preprint arXiv:2312.08007 (2023).
- 3drp-net: 3d relative position-aware network for 3d visual grounding. arXiv preprint arXiv:2307.13363 (2023).
- 3D-STMN: Dependency-Driven Superpoint-Text Matching Network for End-to-End 3D Referring Expression Segmentation. arXiv preprint arXiv:2308.16632 (2023).
- Unified visual-semantic embeddings: Bridging vision and language with structured meaning representations. In CVPR. 6609–6618.
- Towards robust referring image segmentation. IEEE Transactions on Image Processing (2024).
- NExT-GPT: Any-to-Any Multimodal LLM. In Proceedings of the International Conference on Machine Learning.
- Eda: Explicit text-decoupling and dense alignment for 3d visual grounding. In CVPR. 19231–19242.
- GSVA: Generalized Segmentation via Multimodal Large Language Models. arXiv preprint arXiv:2312.10103 (2023).
- Described Object Detection: Liberating Object Detection with Flexible Expressions. Advances in Neural Information Processing Systems 36 (2024).
- Bottom-up shift and reasoning for referring image segmentation. In CVPR. 11266–11275.
- Improving one-stage visual grounding by recursive sub-query construction. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XIV 16. Springer, 387–404.
- Lavt: Language-aware vision transformer for referring image segmentation. In CVPR. 18155–18165.
- Cross-modal self-attention network for referring image segmentation. In CVPR. 10502–10511.
- Mattnet: Modular attention network for referring expression comprehension. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1307–1315.
- Modeling context in referring expressions. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II 14. Springer, 69–85.
- A joint speaker-listener-reinforcer model for referring expressions. In Proceedings of the IEEE conference on computer vision and pattern recognition. 7282–7290.
- Instancerefer: Cooperative holistic understanding for visual grounding on point clouds through instance multi-level contextual referring. In ICCV. 1791–1800.
- Multi3drefer: Grounding text description to multiple 3d objects. In ICCV. 15225–15236.
- 3D object retrieval with multi-feature collaboration and bipartite graph matching. Neurocomputing 195 (2016), 40–49.
- PSALM: Pixelwise SegmentAtion with Large Multi-Modal Model. arXiv preprint arXiv:2403.14598 (2024).
- 3DVG-Transformer: Relation modeling for visual grounding on point clouds. In ICCV. 2928–2937.
- An open and comprehensive pipeline for unified object grounding and detection. arXiv preprint arXiv:2401.02361 (2024).
- Parallel attention: A unified framework for visual object discovery through dialogs and queries. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4252–4261.
- Changli Wu (5 papers)
- Yihang Liu (16 papers)
- Jiayi Ji (51 papers)
- Yiwei Ma (24 papers)
- Haowei Wang (32 papers)
- Gen Luo (32 papers)
- Henghui Ding (87 papers)
- Xiaoshuai Sun (91 papers)
- Rongrong Ji (315 papers)