How to Understand "Support"? An Implicit-enhanced Causal Inference Approach for Weakly-supervised Phrase Grounding (2402.19116v2)
Abstract: Weakly-supervised Phrase Grounding (WPG) is an emerging task of inferring the fine-grained phrase-region matching, while merely leveraging the coarse-grained sentence-image pairs for training. However, existing studies on WPG largely ignore the implicit phrase-region matching relations, which are crucial for evaluating the capability of models in understanding the deep multimodal semantics. To this end, this paper proposes an Implicit-Enhanced Causal Inference (IECI) approach to address the challenges of modeling the implicit relations and highlighting them beyond the explicit. Specifically, this approach leverages both the intervention and counterfactual techniques to tackle the above two challenges respectively. Furthermore, a high-quality implicit-enhanced dataset is annotated to evaluate IECI and detailed evaluations show the great advantages of IECI over the state-of-the-art baselines. Particularly, we observe an interesting finding that IECI outperforms the advanced multimodal LLMs by a large margin on this implicit-enhanced dataset, which may facilitate more research to evaluate the multimodal LLMs in this direction.
- A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity. CoRR, abs/2302.04023.
- Knowledge aided consistency for weakly supervised phrase grounding. In Proceedings of CVPR 2018, pages 4042–4050.
- Contrastive learning with expectation-maximization for weakly supervised phrase grounding. In Proceedings of EMNLP 2022, pages 8549–8559.
- Ref-nms: Breaking proposal bottlenecks in two-stage referring expression grounding. In Proceedings of AAAI 2021, pages 1036–1044.
- Counterfactual samples synthesizing for robust visual question answering. In Proceedings of CVPR 2020, pages 10797–10806.
- Align2ground: Weakly supervised phrase grounding guided by image-caption alignment. In Proceedings of ICCV 2019, pages 2601–2610.
- Transvg: End-to-end visual grounding with transformers. In Proceedings of ICCV 2021, pages 1749–1759.
- BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL 2019, pages 4171–4186.
- Unsupervised image captioning. In Proceedings of CVPR 2019, pages 4125–4134.
- Iterative context-aware graph inference for visual dialog. In Proceedings of CVPR 2020, pages 10052–10061.
- Contrastive learning for weakly supervised phrase grounding. In Proceedings of ECCV 2020, pages 752–768.
- John A Hartigan and Manchek A Wong. 1979. Algorithm as 136: A k-means clustering algorithm. Journal of the royal statistical society. series c (applied statistics), 28(1):100–108.
- Deep residual learning for image recognition. In Proceedings of CVPR 2016, pages 770–778.
- Deconfounded visual grounding. In Proceedings of AAAI 2022, pages 998–1006.
- Referitgame: Referring to objects in photographs of natural scenes. In Proceedings of EMNLP 2014, pages 787–798.
- Visual genome: Connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vis., 123(1):32–73.
- BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. CoRR, abs/2301.12597.
- BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In Proceedings of ICML 2022, pages 12888–12900.
- Align before fuse: Vision and language representation learning with momentum distillation. In Proceedings of NeurIPS 2021, pages 9694–9705.
- In-context learning with many demonstration examples. CoRR, abs/2302.04931.
- Visual instruction tuning. CoRR, abs/2304.08485.
- Adaptive reconstruction network for weakly supervised referring expression grounding. In Proceedings of ICCV 2019, pages 2611–2620.
- Knowledge-guided pairwise reconstruction network for weakly supervised referring expression grounding. In Proceedings of ACM MM 2019, pages 539–547.
- Relation-aware instance refinement for weakly supervised visual grounding. In Proceedings of CVPR 2021, pages 5612–5621.
- Learning cross-modal context graph for visual grounding. In Proceedings of AAAI 2020, pages 11645–11652.
- Learning to specialize with knowledge distillation for visual question answering. In Proceedings of NeurIPS 2018, pages 8092–8102.
- Counterfactual VQA: A cause-effect look at language bias. In Proceedings of CVPR 2021, pages 12700–12710.
- Judea Pearl and Dana Mackenzie. 2018. The book of why: the new science of cause and effect. Basic books.
- GPT self-supervision for a better data annotator. CoRR, abs/2306.04349.
- Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In Proceedings of ICCV 2015, pages 2641–2649.
- Learning transferable visual models from natural language supervision. In Proceedings of ICML 2021, pages 8748–8763.
- Faster R-CNN: towards real-time object detection with region proposal networks. In Proceedings of NeurIPS 2015, pages 91–99.
- Grounding of textual phrases in images by reconstruction. In Proceedings of ECCV 2016, pages 817–834.
- Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res., 15(1):1929–1958.
- Long-tailed classification by keeping the good and removing the bad momentum causal effect. In Proceedings of NeurIPS 2020.
- Unbiased scene graph generation from biased training. In Proceedings of CVPR 2020, pages 3713–3722.
- Improving weakly supervised visual grounding by contrastive knowledge distillation. In Proceedings of CVPR 2021, pages 14090–14100.
- Learning two-branch neural networks for image-text matching tasks. IEEE Trans. Pattern Anal. Mach. Intell., 41(2):394–407.
- MAF: multimodal alignment framework for weakly-supervised phrase grounding. In Proceedings of EMNLP 2020, pages 2030–2038.
- Visual commonsense R-CNN. In Proceedings of CVPR 2020, pages 10757–10767.
- Causal attention for unbiased visual recognition. In Proceedings of ICCV 2021, pages 3091–3100.
- Weakly-supervised visual grounding of phrases with linguistic structures. In Proceedings of CVPR 2017, pages 5253–5262.
- Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of ICML 2015, pages 2048–2057.
- Improving visual grounding with visual-linguistic verification and iterative reasoning. In Proceedings of CVPR 2022, pages 9489–9498.
- Causal attention for vision-language tasks. In Proceedings of CVPR 2021, pages 9847–9857.
- Causal intervention for weakly-supervised semantic segmentation. In Proceedings of NeurIPS 2020.
- Devlbert: Learning deconfounded visio-linguistic representations. In Proceedings of ACM MM 2020, pages 4373–4382.
- Weakly supervised phrase localization with multi-scale anchored transformer network. In Proceedings of CVPR 2018, pages 5696–5705.
- Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR, abs/2304.10592.
- Multi-modal knowledge graph construction and application: A survey. CoRR, abs/2202.05786.
Sponsor
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.