Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
GPT-5.1
GPT-5.1 130 tok/s
Gemini 3.0 Pro 29 tok/s Pro
Gemini 2.5 Flash 145 tok/s Pro
Kimi K2 191 tok/s Pro
Claude Sonnet 4.5 34 tok/s Pro
2000 character limit reached

Mono3DVG: 3D Visual Grounding in Monocular Images (2312.08022v1)

Published 13 Dec 2023 in cs.CV

Abstract: We introduce a novel task of 3D visual grounding in monocular RGB images using language descriptions with both appearance and geometry information. Specifically, we build a large-scale dataset, Mono3DRefer, which contains 3D object targets with their corresponding geometric text descriptions, generated by ChatGPT and refined manually. To foster this task, we propose Mono3DVG-TR, an end-to-end transformer-based network, which takes advantage of both the appearance and geometry information in text embeddings for multi-modal learning and 3D object localization. Depth predictor is designed to explicitly learn geometry features. The dual text-guided adapter is proposed to refine multiscale visual and geometry features of the referred object. Based on depth-text-visual stacking attention, the decoder fuses object-level geometric cues and visual appearance into a learnable query. Comprehensive benchmarks and some insightful analyses are provided for Mono3DVG. Extensive comparisons and ablation studies show that our method significantly outperforms all baselines. The dataset and code will be publicly available at: https://github.com/ZhanYang-nwpu/Mono3DVG.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (60)
  1. Referit3d: Neural listeners for fine-grained 3d object identification in real-world scenes. In ECCV, 422–440.
  2. MonoFENet: Monocular 3D Object Detection With Feature Enhancement Networks. IEEE Transactions on Image Processing, 29: 2753–2765.
  3. Omni3D: A Large Benchmark and Model for 3D Object Detection in the Wild. In CVPR, 13154–13164.
  4. M3d-rpn: Monocular 3d region proposal network for object detection. In ICCV, 9287–9296.
  5. Kinematic 3d object detection in monocular video. In ECCV, 135–152.
  6. 3djcg: A unified framework for joint dense captioning and visual grounding on 3d point clouds. In CVPR, 16464–16473.
  7. Scanrefer: 3d object localization in rgb-d scans using natural language. In ECCV, 202–221.
  8. D3Net: A Unified Speaker-Listener Architecture for 3D Dense Captioning and Visual Grounding. In ECCV, 487–505.
  9. Query-guided regression network with context policy for phrase grounding. In ICCV, 824–832.
  10. Real-time referring expression comprehension by single-stage grounding network. arXiv preprint arXiv:1812.03426.
  11. MonoPair: Monocular 3D Object Detection Using Pairwise Spatial Relationships. In CVPR, 12093–12102.
  12. Pseudo-stereo for monocular 3d object detection in autonomous driving. In CVPR, 887–897.
  13. TransVG: End-to-End Visual Grounding With Transformers. In ICCV, 1769–1779.
  14. Learning depth-guided convolutions for monocular 3d object detection. In CVPR workshops, 1000–1001.
  15. Visual Grounding with Transformers. In 2022 IEEE International Conference on Multimedia and Expo, 1–6.
  16. Free-form description guided 3d visual graph network for object grounding in point cloud. In ICCV, 3722–3731.
  17. Are we ready for autonomous driving? the kitti vision benchmark suite. In CVPR, 3354–3361.
  18. Transrefer3d: Entity-and-relation aware transformer for fine-grained 3d visual grounding. In Proceedings of the 29th ACM International Conference on Multimedia, 2344–2352.
  19. Deep residual learning for image recognition. In CVPR, 770–778.
  20. Learning to Compose and Reason with Language Tree Structures for Visual Grounding. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(2): 684–696.
  21. Modeling relationships in referential expressions with compositional modular networks. In CVPR, 1115–1124.
  22. Look before you leap: Learning landmark features for one-stage visual grounding. In CVPR, 16888–16897.
  23. MonoDTR: Monocular 3D Object Detection With Depth-Aware Transformer. In CVPR, 4012–4021.
  24. Multi-view transformer for 3d visual grounding. In CVPR, 15524–15533.
  25. Referring transformer: A one-step approach to multi-task visual grounding. In Advances in Neural Information Processing Systems, volume 34, 19652–19664.
  26. Progressive Language-Customized Visual Feature Learning for One-Stage Visual Grounding. IEEE Transactions on Image Processing, 31: 4266–4277.
  27. Focal loss for dense object detection. In ICCV, 2980–2988.
  28. WildRefer: 3D Object Localization in Large-scale Dynamic Scenes with Multi-modal Visual Data and Natural Language. arXiv preprint arXiv:2304.05645.
  29. Learning to assemble neural module tree networks for visual grounding. In ICCV, 4673–4682.
  30. Refer-it-in-rgbd: A bottom-up approach for 3d visual grounding in rgbd images. In CVPR, 6032–6041.
  31. Improving referring expression grounding with cross-modal attention-guided erasing. In CVPR, 1950–1959.
  32. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
  33. Smoke: Single-stage monocular 3d object detection via keypoint estimation. In CVPR Workshops, 996–997.
  34. Delving Into Localization Errors for Monocular 3D Object Detection. In CVPR, 4721–4730.
  35. Sun-spot: An rgb-d dataset with spatial referring expressions. In ICCV Workshops, 1883–1886.
  36. Is pseudo-lidar needed for monocular 3d object detection? In ICCV, 3142–3152.
  37. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In Advances in neural information processing systems, volume 30.
  38. Reverie: Remote embodied visual referring expression in real indoor environments. In CVPR, 9982–9991.
  39. Monogrnet: A geometric reasoning network for monocular 3d object localization. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, 8851–8858.
  40. Generalized intersection over union: A metric and a loss for bounding box regression. In CVPR, 658–666.
  41. Languagerefer: Spatial-language model for 3d visual grounding. In Conference on Robot Learning, 1046–1056.
  42. Zero-shot grounding of objects from natural language queries. In ICCV, 4694–4703.
  43. A proposal-free one-stage framework for referring expression comprehension and generation via dense cross-attention. IEEE Transactions on Multimedia.
  44. Neighbourhood watch: Referring expression comprehension via language-guided graph attention networks. In CVPR, 1960–1968.
  45. Fcos3d: Fully convolutional one-stage monocular 3d object detection. In ICCV, 913–922.
  46. Improving Visual Grounding with Visual-Linguistic Verification and Iterative Reasoning. In CVPR, 9499–9508.
  47. Dynamic graph attention for referring expression comprehension. In ICCV, 4644–4653.
  48. Relationship-embedded representation learning for grounding referring expressions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(8): 2765–2779.
  49. Improving one-stage visual grounding by recursive sub-query construction. In ECCV, 387–404.
  50. A fast and accurate one-stage approach to visual grounding. In ICCV, 4683–4693.
  51. Sat: 2d semantics assisted training for 3d visual grounding. In ICCV, 1856–1866.
  52. Shifting More Attention to Visual Backbone: Query-modulated Refinement Networks for End-to-End Visual Grounding. In CVPR, 15502–15512.
  53. Mattnet: Modular attention network for referring expression comprehension. In CVPR, 1307–1315.
  54. Rethinking diversified and discriminative proposal generation for visual grounding. In Proceedings of the 27th International Joint Conference on Artificial Intelligence, 1114–1120.
  55. Instancerefer: Cooperative holistic understanding for visual grounding on point clouds through instance multi-level contextual referring. In ICCV, 1791–1800.
  56. RSVG: Exploring Data and Models for Visual Grounding on Remote Sensing Data. IEEE Transactions on Geoscience and Remote Sensing, 61: 1–13.
  57. Grounding referring expressions in images by variational context. In CVPR, 4158–4166.
  58. MonoDETR: depth-guided transformer for monocular 3D object detection. arXiv preprint arXiv:2203.13310.
  59. Objects are different: Flexible monocular 3d object detection. In CVPR, 3289–3298.
  60. 3DVG-Transformer: Relation modeling for visual grounding on point clouds. In ICCV, 2928–2937.
Citations (7)

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Github Logo Streamline Icon: https://streamlinehq.com