Supplementing Missing Visions via Dialog for Scene Graph Generations (2204.11143v2)
Abstract: Most current AI systems rely on the premise that the input visual data are sufficient to achieve competitive performance in various computer vision tasks. However, the classic task setup rarely considers the challenging, yet common practical situations where the complete visual data may be inaccessible due to various reasons (e.g., restricted view range and occlusions). To this end, we investigate a computer vision task setting with incomplete visual input data. Specifically, we exploit the Scene Graph Generation (SGG) task with various levels of visual data missingness as input. While insufficient visual input intuitively leads to performance drop, we propose to supplement the missing visions via the natural language dialog interactions to better accomplish the task objective. We design a model-agnostic Supplementary Interactive Dialog (SI-Dial) framework that can be jointly learned with most existing models, endowing the current AI systems with the ability of question-answer interactions in natural language. We demonstrate the feasibility of such a task setting with missing visual input and the effectiveness of our proposed dialog module as the supplementary information source through extensive experiments and analysis, by achieving promising performance improvement over multiple baselines.
- “Scene graph generation by iterative message passing,” in CVPR, 2017.
- “Neural motifs: Scene graph parsing with global context,” in CVPR, 2018.
- “Graph r-cnn for scene graph generation,” in ECCV, 2018.
- “Scene graph generation with external knowledge and image reconstruction,” in CVPR, 2019.
- “Counterfactual critic multi-agent training for scene graph generation,” in ICCV, 2019.
- “Scene graph generation from objects, phrases and region captions,” in ICCV, 2017.
- “Graphical contrastive losses for scene graph parsing,” in ICCV, 2019.
- “Visual dialog,” in CVPR, 2017.
- “Guesswhat?! visual object discovery through multi-modal dialogue,” in CVPR, 2017.
- “Recursive visual attention in visual dialog,” in CVPR, 2019.
- “Learning cooperative visual dialog agents with deep reinforcement learning,” in ICCV, 2017.
- “Visual reference resolution using attention memory for visual dialog,” in NeurIPS, 2017.
- “A study of face obfuscation in imagenet,” ICML, 2022.
- “Audio visual scene-aware dialog,” in CVPR, 2019.
- “Vqa: Visual question answering,” in ICCV, 2015.
- “Stacked attention networks for image question answering,” in CVPR, 2016.
- “Dynamic memory networks for visual and textual question answering,” in ICML, 2016.
- “Hierarchical question-image co-attention for visual question answering,” in NeurIPS, 2016.
- “Learning to compose dynamic tree structures for visual contexts,” in CVPR, 2019.
- “Saying the unseen: Video descriptions via dialog agents,” in TPAMI, 2021.
- “Sentence-bert: Sentence embeddings using siamese bert-networks,” in EMNLP, 2019.
- “Making monolingual sentence embeddings multilingual using knowledge distillation,” in EMNLP, 2020.
- “Describing unseen videos via multi-modal cooperative dialog agents,” in ECCV, 2020.
- “Factor graph attention,” in CVPR, 2019.
- “Unbiased scene graph generation from biased training,” in CVPR, 2020.
- “Visual genome: Connecting language and vision using crowdsourced dense image annotations,” IJCV, 2017.
- Zhenghao Zhao (3 papers)
- Ye Zhu (75 papers)
- Xiaoguang Zhu (11 papers)
- Yuzhang Shang (35 papers)
- Yan Yan (242 papers)