Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Bridging Modality Gap for Visual Grounding with Effecitve Cross-modal Distillation (2312.17648v2)

Published 29 Dec 2023 in cs.CV and cs.AI

Abstract: Visual grounding aims to align visual information of specific regions of images with corresponding natural language expressions. Current visual grounding methods leverage pre-trained visual and language backbones independently to obtain visual features and linguistic features. Although these two types of features are then fused through elaborately designed networks, the heterogeneity of the features renders them unsuitable for multi-modal reasoning. This problem arises from the domain gap between the single-modal pre-training backbones used in current visual grounding methods, which can hardly be bridged by the traditional end-to-end training method. To alleviate this, our work proposes an Empowering Pre-trained Model for Visual Grounding (EpmVG) framework, which distills a multimodal pre-trained model to guide the visual grounding task. EpmVG relies on a novel cross-modal distillation mechanism that can effectively introduce the consistency information of images and texts from the pre-trained model, reducing the domain gap in the backbone networks, and thereby improving the performance of the model in the visual grounding task. Extensive experiments have been conducted on five conventionally used datasets, and the results demonstrate that our method achieves better performance than state-of-the-art methods.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (67)
  1. Transferring inductive biases through knowledge distillation. arXiv: Learning,arXiv: Learning .
  2. On pursuit of designing multi-modal transformer for video grounding, in: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 9810–9823.
  3. Correspondence matters for video referring expression comprehension, in: Proceedings of the 30th ACM International Conference on Multimedia, pp. 4967–4976.
  4. Iterative proposal refinement for weakly-supervised video grounding, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6524–6534.
  5. Locvtp: Video-text pre-training for temporal localization, in: European Conference on Computer Vision, Springer. pp. 38–56.
  6. Deep motion prior for weakly-supervised temporal action localization. IEEE Transactions on Image Processing 31, 5203–5213.
  7. Ref-nms: Breaking proposal bottlenecks in two-stage referring expression grounding, in: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 1036–1044.
  8. Real-time referring expression comprehension by single-stage grounding network. arXiv preprint arXiv:1812.03426 .
  9. Natural language processing. Fundamentals of artificial intelligence , 603–649.
  10. Understanding augmented reality: Concepts and applications .
  11. Transvg: End-to-end visual grounding with transformers. arXiv preprint arXiv:2104.08541 .
  12. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 .
  13. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 .
  14. The segmented and annotated iapr tc-12 benchmark. Computer vision and image understanding 114, 419–428.
  15. Fast r-cnn, in: Proceedings of the IEEE international conference on computer vision, pp. 1440–1448.
  16. Knowledge distillation: A survey. International Journal of Computer Vision , 1789–1819.
  17. Automation. Production Systems and Computer Integrated Manufacturing 2.
  18. The iapr tc-12 benchmark: A new evaluation resource for visual information systems, in: International workshop ontoImage.
  19. Open-vocabulary object detection via vision and language knowledge distillation. Learning,Learning .
  20. A comprehensive study for robot navigation techniques. Cogent Engineering 6, 1632046.
  21. Deep residual learning for image recognition, in: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778.
  22. Learning to compose and reason with language tree structures for visual grounding. IEEE transactions on pattern analysis and machine intelligence .
  23. Modeling relationships in referential expressions with compositional modular networks, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1115–1124.
  24. Natural language object retrieval, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4555–4564.
  25. Look before you leap: Learning landmark features for one-stage visual grounding, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16888–16897.
  26. Scaling up visual and vision-language representation learning with noisy text supervision. Cornell University - arXiv,Cornell University - arXiv .
  27. Mdetr-modulated detection for end-to-end multi-modal understanding, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1780–1790.
  28. Referitgame: Referring to objects in photographs of natural scenes, in: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 787–798.
  29. G2l: Semantically aligned and uniform video grounding via geodesic and game theory, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 12032–12042.
  30. Deep attribute-preserving metric learning for natural language object retrieval, in: Proceedings of the 25th ACM international conference on Multimedia, pp. 181–189.
  31. A real-time cross-modality correlation filtering method for referring expression comprehension, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10880–10889.
  32. Microsoft coco: Common objects in context, in: European conference on computer vision, Springer. pp. 740–755.
  33. Learning to assemble neural module tree networks for visual grounding, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4673–4682.
  34. Improving referring expression grounding with cross-modal attention-guided erasing, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1950–1959.
  35. Swin transformer: Hierarchical vision transformer using shifted windows. International Conference on Computer Vision (ICCV) .
  36. Generation and comprehension of unambiguous object descriptions, in: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 11–20.
  37. Disentangled motif-aware graph learning for phrase grounding, in: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 13587–13594.
  38. Modeling context between objects for referring expression understanding, in: Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part IV 14, Springer. pp. 792–807.
  39. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models, in: Proceedings of the IEEE international conference on computer vision, pp. 2641–2649.
  40. Language-aware fine-grained object representation for referring expression comprehension, in: Proceedings of the 28th ACM International Conference on Multimedia, pp. 4171–4180.
  41. Learning transferable visual models from natural language supervision, in: International conference on machine learning, PMLR. pp. 8748–8763.
  42. Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767 .
  43. Faster r-cnn: Towards real-time object detection with region proposal networks. arXiv preprint arXiv:1506.01497 .
  44. Generalized intersection over union: A metric and a loss for bounding box regression, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 658–666.
  45. Computer vision. Prentice Hall PTR.
  46. Training data-efficient image transformers distillation through attention. International Conference on Machine Learning,International Conference on Machine Learning .
  47. Attention is all you need, in: Advances in neural information processing systems, pp. 5998–6008.
  48. Cosface: Large margin cosine loss for deep face recognition, in: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5265–5274.
  49. Learning two-branch neural networks for image-text matching tasks. IEEE Transactions on Pattern Analysis and Machine Intelligence 41, 394–407.
  50. Neighbourhood watch: Referring expression comprehension via language-guided graph attention networks, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1960–1968.
  51. Evolutionary neural architecture search for image restoration, in: 2019 International Joint Conference on Neural Networks (IJCNN), IEEE. pp. 1–8.
  52. Learning texture transformer network for image super-resolution, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 5791–5800.
  53. Improving visual grounding with visual-linguistic verification and iterative reasoning, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9499–9508.
  54. Dynamic graph attention for referring expression comprehension, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4644–4653.
  55. Graph-structured referring expression reasoning in the wild, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9952–9961.
  56. Improving one-stage visual grounding by recursive sub-query construction, in: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XIV 16, Springer. pp. 387–404.
  57. A fast and accurate one-stage approach to visual grounding, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4683–4693.
  58. One-stage visual grounding via semantic-aware feature filter, in: Proceedings of the 29th ACM International Conference on Multimedia, pp. 1702–1711.
  59. Shifting more attention to visual backbone: Query-modulated refinement networks for end-to-end visual grounding, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15502–15512.
  60. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics 2, 67–78.
  61. Mattnet: Modular attention network for referring expression comprehension, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1307–1315.
  62. Modeling context in referring expressions, in: European Conference on Computer Vision, Springer. pp. 69–85.
  63. Rethinking diversified and discriminative proposal generation for visual grounding. arXiv preprint arXiv:1805.03508 .
  64. Cola: Weakly-supervised temporal action localization with snippet contrastive learning, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16010–16019.
  65. Unsupervised pre-training for temporal action localization tasks, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14031–14041.
  66. Grounding referring expressions in images by variational context, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4158–4166.
  67. Deformable detr: Deformable transformers for end-to-end object detection. arXiv: Computer Vision and Pattern Recognition,arXiv: Computer Vision and Pattern Recognition .
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Jiaxi Wang (9 papers)
  2. Wenhui Hu (4 papers)
  3. Xueyang Liu (3 papers)
  4. Beihu Wu (1 paper)
  5. Yuting Qiu (1 paper)
  6. YingYing Cai (3 papers)