Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
169 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

WaterVG: Waterway Visual Grounding based on Text-Guided Vision and mmWave Radar (2403.12686v3)

Published 19 Mar 2024 in cs.CV, cs.MM, and cs.RO

Abstract: The perception of waterways based on human intent is significant for autonomous navigation and operations of Unmanned Surface Vehicles (USVs) in water environments. Inspired by visual grounding, we introduce WaterVG, the first visual grounding dataset designed for USV-based waterway perception based on human prompts. WaterVG encompasses prompts describing multiple targets, with annotations at the instance level including bounding boxes and masks. Notably, WaterVG includes 11,568 samples with 34,987 referred targets, whose prompts integrates both visual and radar characteristics. The pattern of text-guided two sensors equips a finer granularity of text prompts with visual and radar features of referred targets. Moreover, we propose a low-power visual grounding model, Potamoi, which is a multi-task model with a well-designed Phased Heterogeneous Modality Fusion (PHMF) mode, including Adaptive Radar Weighting (ARW) and Multi-Head Slim Cross Attention (MHSCA). Exactly, ARW extracts required radar features to fuse with vision for prompt alignment. MHSCA is an efficient fusion module with a remarkably small parameter count and FLOPs, elegantly fusing scenario context captured by two sensors with linguistic features, which performs expressively on visual grounding tasks. Comprehensive experiments and evaluations have been conducted on WaterVG, where our Potamoi archives state-of-the-art performances compared with counterparts.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (55)
  1. Real-time referring expression comprehension by single-stage grounding network. arXiv preprint arXiv:1812.03426 (2018).
  2. Cops-ref: A new dataset and task on compositional referring expression comprehension. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10086–10095.
  3. Are we ready for unmanned surface vehicles in inland waterways? The usvinland multisensor dataset and benchmark. IEEE Robotics and Automation Letters 6, 2 (2021), 3964–3970.
  4. Robust small object detection on the water surface through fusion of camera and millimeter wave radar. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 15263–15272.
  5. Flow: A dataset and benchmark for floating waste detection in inland waters. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 10953–10962.
  6. Rethinking Attention with Performers. In International Conference on Learning Representations.
  7. Grzegorz Chrupała. 2022. Visually grounded models of spoken language: A survey of datasets, architectures and evaluation techniques. Journal of Artificial Intelligence Research 73 (2022), 673–707.
  8. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. In Advances in Neural Information Processing Systems.
  9. Transvg: End-to-end visual grounding with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1769–1779.
  10. Talk2Car: Taking Control of Your Self-Driving Car. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2088–2098.
  11. Vlt: Vision-language transformer and query generation for referring segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2022).
  12. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In International Conference on Learning Representations.
  13. Visual grounding with transformers. In 2022 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 1–6.
  14. Ross Girshick. 2015. Fast r-cnn. In Proceedings of the IEEE international conference on computer vision. 1440–1448.
  15. Lightweight Traffic Referring Expression Comprehension System Based on Sparse Cross-Modal Attention and Multi-Modal Joint Augmentation. (2023). https://doi.org/10.13140/RG.2.2.28608.51204
  16. Mask-VRDet: A robust riverway panoptic perception model based on dual graph fusion of vision and 4D mmWave radar. Robotics and Autonomous Systems 171 (2024), 104572.
  17. Achelous: A fast unified water-surface panoptic perception framework based on fusion of monocular camera and 4d mmwave radar. arXiv preprint arXiv:2307.07102 (2023).
  18. Efficient-vrnet: An exquisite fusion network for riverway panoptic perception based on asymmetric fair fusion of vision and 4d mmwave radar. arXiv preprint arXiv:2308.10287 (2023).
  19. Achelous++: Power-Oriented Water-Surface Panoptic Perception Framework on Edge Devices based on Vision-Radar Fusion and Pruning of Heterogeneous Modalities. arXiv preprint arXiv:2312.08851 (2023).
  20. Asynchronous Trajectory Matching-Based Multimodal Maritime Data Fusion for Vessel Traffic Surveillance in Inland Waterways. IEEE Transactions on Intelligent Transportation Systems (2023).
  21. Agent Attention: On the Integration of Softmax and Linear Attention. arXiv preprint arXiv:2312.08874 (2023).
  22. Grec: Generalized referring expression comprehension. arXiv preprint arXiv:2308.16182 (2023).
  23. Learning to compose and reason with language tree structures for visual grounding. IEEE transactions on pattern analysis and machine intelligence 44, 2 (2019), 684–696.
  24. Ultralytics YOLO. https://github.com/ultralytics/ultralytics
  25. Referitgame: Referring to objects in photographs of natural scenes. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 787–798.
  26. Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In Proceedings of the IEEE conference on computer vision and pattern recognition. 7482–7491.
  27. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision 123 (2017), 32–73.
  28. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations. (2020).
  29. YOLOv6: A single-stage object detection framework for industrial applications. arXiv preprint arXiv:2209.02976 (2022).
  30. Muchen Li and Leonid Sigal. 2021. Referring transformer: A one-step approach to multi-task visual grounding. Advances in neural information processing systems 34 (2021), 19652–19664.
  31. Generalized focal loss: Learning qualified and distributed bounding boxes for dense object detection. Advances in Neural Information Processing Systems 33 (2020), 21002–21012.
  32. GRES: Generalized referring expression segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 23592–23601.
  33. Learning to assemble neural module tree networks for visual grounding. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 4673–4682.
  34. What Goes beyond Multi-modal Fusion in One-stage Referring Expression Comprehension: An Empirical Study. arXiv preprint arXiv:2204.07913 (2022).
  35. Multi-task collaborative network for joint referring expression comprehension and segmentation. In Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition. 10034–10043.
  36. Pp-liteseg: A superior real-time semantic segmentation model. arXiv preprint arXiv:2204.02681 (2022).
  37. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In Proceedings of the IEEE international conference on computer vision. 2641–2649.
  38. Referring expression comprehension: A survey of methods and datasets. IEEE Transactions on Multimedia 23 (2020), 4426–4440.
  39. FastViT: A fast hybrid vision transformer using structural reparameterization. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 5785–5795.
  40. Attention is all you need. Advances in neural information processing systems 30 (2017).
  41. Give me something to eat: Referring expression comprehension with commonsense knowledge. In Proceedings of the 28th ACM International Conference on Multimedia. 28–36.
  42. Neighbourhood watch: Referring expression comprehension via language-guided graph attention networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1960–1968.
  43. Referring Multi-Object Tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14633–14642.
  44. Dynamic graph attention for referring expression comprehension. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 4644–4653.
  45. A fast and accurate one-stage approach to visual grounding. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 4683–4693.
  46. Radar-Camera Fusion for Object Detection and Semantic Segmentation in Autonomous Driving: A Comprehensive Review. IEEE Transactions on Intelligent Vehicles (2023), 1–40. https://doi.org/10.1109/TIV.2023.3307157
  47. Radar Perception in Autonomous Driving: Exploring Different Data Representations. arXiv preprint arXiv:2312.04861 (2023).
  48. Waterscenes: A multi-task 4d radar-camera fusion dataset and benchmark for autonomous driving on water surfaces. arXiv preprint arXiv:2307.06505 (2023).
  49. Mattnet: Modular attention network for referring expression comprehension. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1307–1315.
  50. Modeling context in referring expressions. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II 14. Springer, 69–85.
  51. Rsvg: Exploring data and models for visual grounding on remote sensing data. IEEE Transactions on Geoscience and Remote Sensing 61 (2023), 1–13.
  52. A robust deep affinity network for multiple ship tracking. IEEE Transactions on Instrumentation and Measurement 70 (2021), 1–20.
  53. Distance-IoU loss: Faster and better learning for bounding box regression. In Proceedings of the AAAI conference on artificial intelligence, Vol. 34. 12993–13000.
  54. Vision language models in autonomous driving and intelligent transportation systems. arXiv preprint arXiv:2310.14414 (2023).
  55. Seqtr: A simple yet universal network for visual grounding. In European Conference on Computer Vision. Springer, 598–615.
Citations (4)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com