Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

3D-GRES: Generalized 3D Referring Expression Segmentation (2407.20664v2)

Published 30 Jul 2024 in cs.CV

Abstract: 3D Referring Expression Segmentation (3D-RES) is dedicated to segmenting a specific instance within a 3D space based on a natural language description. However, current approaches are limited to segmenting a single target, restricting the versatility of the task. To overcome this limitation, we introduce Generalized 3D Referring Expression Segmentation (3D-GRES), which extends the capability to segment any number of instances based on natural language instructions. In addressing this broader task, we propose the Multi-Query Decoupled Interaction Network (MDIN), designed to break down multi-object segmentation tasks into simpler, individual segmentations. MDIN comprises two fundamental components: Text-driven Sparse Queries (TSQ) and Multi-object Decoupling Optimization (MDO). TSQ generates sparse point cloud features distributed over key targets as the initialization for queries. Meanwhile, MDO is tasked with assigning each target in multi-object scenarios to different queries while maintaining their semantic consistency. To adapt to this new task, we build a new dataset, namely Multi3DRes. Our comprehensive evaluations on this dataset demonstrate substantial enhancements over existing models, thus charting a new path for intricate multi-object 3D scene comprehension. The benchmark and code are available at https://github.com/sosppxo/MDIN.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (86)
  1. Referit3d: Neural listeners for fine-grained 3d object identification in real-world scenes. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I 16. Springer, 422–440.
  2. End-to-end object detection with transformers. In European conference on computer vision. Springer, 213–229.
  3. Scanrefer: 3d object localization in rgb-d scans using natural language. In European conference on computer vision. Springer, 202–221.
  4. Language conditioned spatial relation reasoning for 3d object grounding. Advances in neural information processing systems 35 (2022), 20522–20535.
  5. Back-tracing representative points for voting-based 3d object detection in point clouds. In CVPR. 8963–8972.
  6. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition. 5828–5839.
  7. Instructdet: Diversifying referring object detection with generalized instructions. arXiv preprint arXiv:2310.05136 (2023).
  8. Visual grounding via accumulated attention. In Proceedings of the IEEE conference on computer vision and pattern recognition. 7746–7755.
  9. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
  10. Vision-language transformer and query generation for referring segmentation. In ICCV. 16321–16330.
  11. VLT: Vision-language transformer and query generation for referring segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 45, 6 (2023).
  12. Scene Graph as Pivoting: Inference-time Image-free Unsupervised Multimodal Machine Translation with Visual Scene Hallucination. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics. 5980–5994.
  13. Dysen-VDM: Empowering Dynamics-aware Text-to-Video Diffusion with LLMs. In CVPR. 7641–7653.
  14. Enhancing video-language representations with structural spatio-temporal alignment. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024).
  15. Free-form description guided 3d visual graph network for object grounding in point cloud. In ICCV. 3722–3731.
  16. 3d semantic segmentation with submanifold sparse convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 9224–9232.
  17. Transrefer3d: Entity-and-relation aware transformer for fine-grained 3d visual grounding. In Proceedings of the 29th ACM International Conference on Multimedia. 2344–2352.
  18. Shuting He and Henghui Ding. 2024. RefMask3D: Language-Guided Transformer for 3D Referring Segmentation. arXiv preprint arXiv:2407.18244 (2024).
  19. SegPoint: Segment Any Point Cloud via Large Language Model. arXiv preprint arXiv:2407.13761 (2024).
  20. GREC: Generalized Referring Expression Comprehension. arXiv preprint arXiv:2308.16182 (2023).
  21. Learning to compose and reason with language tree structures for visual grounding. IEEE transactions on pattern analysis and machine intelligence 44, 2 (2019), 684–696.
  22. Modeling relationships in referential expressions with compositional modular networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1115–1124.
  23. Segmentation from natural language expressions. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14. Springer, 108–124.
  24. Natural language object retrieval. In Proceedings of the IEEE conference on computer vision and pattern recognition. 4555–4564.
  25. Beyond one-to-one: Rethinking the referring image segmentation. In ICCV. 4067–4077.
  26. Bi-directional relationship inferring network for referring image segmentation. In CVPR. 4424–4433.
  27. Text-guided graph neural networks for referring 3d instance segmentation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 1610–1618.
  28. Dense Object Grounding in 3D Scenes. In Proceedings of the 31st ACM International Conference on Multimedia. 5017–5026.
  29. Two-stage visual cues enhancement network for referring image segmentation. In Proceedings of the 29th ACM international conference on multimedia. 1331–1340.
  30. Locate then segment: A strong pipeline for referring image segmentation. In CVPR. 9858–9867.
  31. Referitgame: Referring to objects in photographs of natural scenes. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 787–798.
  32. Mask-attention-free transformer for 3d instance segmentation. In ICCV. 3693–3703.
  33. Loic Landrieu and Martin Simonovsky. 2018. Large-scale point cloud semantic segmentation with superpoint graphs. In Proceedings of the IEEE conference on computer vision and pattern recognition. 4558–4567.
  34. Referring image segmentation via recurrent refinement networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5745–5753.
  35. A Unified Framework for 3D Point Cloud Visual Grounding. arXiv:2308.11887 [cs.CV]
  36. Gres: Generalized referring expression segmentation. In CVPR. 23592–23601.
  37. Multi-modal mutual attention and iterative interaction for referring image segmentation. IEEE Transactions on Image Processing (2023).
  38. Instance-specific feature propagation for referring segmentation. IEEE Transactions on Multimedia (2022).
  39. Learning to assemble neural module tree networks for visual grounding. In ICCV. 4673–4682.
  40. Remoteclip: A vision language foundation model for remote sensing. IEEE Transactions on Geoscience and Remote Sensing (2024).
  41. CARIS: Context-aware referring image segmentation. In Proceedings of the 31st ACM International Conference on Multimedia. 779–788.
  42. Improving referring expression grounding with cross-modal attention-guided erasing. In CVPR. 1950–1959.
  43. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019).
  44. Group-free 3d object detection via transformers. In ICCV. 2949–2958.
  45. Cascade grouped attention network for referring expression segmentation. In Proceedings of the 28th ACM International Conference on Multimedia. 1274–1282.
  46. Multi-task collaborative network for joint referring expression comprehension and segmentation. In CVPR. 10034–10043.
  47. 3d-sps: Single-stage 3d visual grounding via referred point progressive selection. In CVPR. 16454–16463.
  48. Towards local visual modeling for image captioning. Pattern Recognition 138 (2023), 109420.
  49. X-clip: End-to-end multi-grained contrastive learning for video-text retrieval. In Proceedings of the 30th ACM International Conference on Multimedia. 638–647.
  50. X-mesh: Towards fast and accurate text-driven 3d stylization via dynamic textual guidance. In ICCV. 2749–2760.
  51. Generation and comprehension of unambiguous object descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition. 11–20.
  52. V-net: Fully convolutional neural networks for volumetric medical image segmentation. In 2016 fourth international conference on 3D vision (3DV). Ieee, 565–571.
  53. Carsten Moenning and Neil A Dodgson. 2003. Fast marching farthest point sampling. Technical Report. University of Cambridge, Computer Laboratory.
  54. Modeling context between objects for referring expression understanding. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part IV 14. Springer, 792–807.
  55. Deep hough voting for 3d object detection in point clouds. In ICCV. 9277–9286.
  56. X-RefSeg3D: Enhancing Referring 3D Instance Segmentation via Structured Cross-Modal Graph Neural Networks. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 4551–4559.
  57. Zero-shot grounding of objects from natural language queries. In ICCV. 4694–4703.
  58. Generating semantically precise scene graphs from textual descriptions for improved image retrieval. In Proceedings of the fourth workshop on vision and language. 70–80.
  59. From points to parts: 3d object detection from point cloud with part-aware and part-aggregation network. IEEE transactions on pattern analysis and machine intelligence 43, 8 (2020), 2647–2664.
  60. Referring expression comprehension using language adaptive inference. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 37. 2357–2365.
  61. Superpoint transformer for 3d scene instance segmentation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 37. 2393–2401.
  62. Attention is all you need. Advances in neural information processing systems 30 (2017).
  63. Neighbourhood watch: Referring expression comprehension via language-guided graph attention networks. In CVPR. 1960–1968.
  64. Unveiling Parts Beyond Objects: Towards Finer-Granularity Referring Expression Segmentation. arXiv preprint arXiv:2312.08007 (2023).
  65. 3drp-net: 3d relative position-aware network for 3d visual grounding. arXiv preprint arXiv:2307.13363 (2023).
  66. 3D-STMN: Dependency-Driven Superpoint-Text Matching Network for End-to-End 3D Referring Expression Segmentation. arXiv preprint arXiv:2308.16632 (2023).
  67. Unified visual-semantic embeddings: Bridging vision and language with structured meaning representations. In CVPR. 6609–6618.
  68. Towards robust referring image segmentation. IEEE Transactions on Image Processing (2024).
  69. NExT-GPT: Any-to-Any Multimodal LLM. In Proceedings of the International Conference on Machine Learning.
  70. Eda: Explicit text-decoupling and dense alignment for 3d visual grounding. In CVPR. 19231–19242.
  71. GSVA: Generalized Segmentation via Multimodal Large Language Models. arXiv preprint arXiv:2312.10103 (2023).
  72. Described Object Detection: Liberating Object Detection with Flexible Expressions. Advances in Neural Information Processing Systems 36 (2024).
  73. Bottom-up shift and reasoning for referring image segmentation. In CVPR. 11266–11275.
  74. Improving one-stage visual grounding by recursive sub-query construction. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XIV 16. Springer, 387–404.
  75. Lavt: Language-aware vision transformer for referring image segmentation. In CVPR. 18155–18165.
  76. Cross-modal self-attention network for referring image segmentation. In CVPR. 10502–10511.
  77. Mattnet: Modular attention network for referring expression comprehension. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1307–1315.
  78. Modeling context in referring expressions. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II 14. Springer, 69–85.
  79. A joint speaker-listener-reinforcer model for referring expressions. In Proceedings of the IEEE conference on computer vision and pattern recognition. 7282–7290.
  80. Instancerefer: Cooperative holistic understanding for visual grounding on point clouds through instance multi-level contextual referring. In ICCV. 1791–1800.
  81. Multi3drefer: Grounding text description to multiple 3d objects. In ICCV. 15225–15236.
  82. 3D object retrieval with multi-feature collaboration and bipartite graph matching. Neurocomputing 195 (2016), 40–49.
  83. PSALM: Pixelwise SegmentAtion with Large Multi-Modal Model. arXiv preprint arXiv:2403.14598 (2024).
  84. 3DVG-Transformer: Relation modeling for visual grounding on point clouds. In ICCV. 2928–2937.
  85. An open and comprehensive pipeline for unified object grounding and detection. arXiv preprint arXiv:2401.02361 (2024).
  86. Parallel attention: A unified framework for visual object discovery through dialogs and queries. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4252–4261.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Changli Wu (5 papers)
  2. Yihang Liu (16 papers)
  3. Jiayi Ji (51 papers)
  4. Yiwei Ma (24 papers)
  5. Haowei Wang (32 papers)
  6. Gen Luo (32 papers)
  7. Henghui Ding (87 papers)
  8. Xiaoshuai Sun (91 papers)
  9. Rongrong Ji (315 papers)
Citations (4)

Summary

We haven't generated a summary for this paper yet.

Github Logo Streamline Icon: https://streamlinehq.com