Towards Unseen Triples: Effective Text-Image-joint Learning for Scene Graph Generation (2306.13420v1)
Abstract: Scene Graph Generation (SGG) aims to structurally and comprehensively represent objects and their connections in images, it can significantly benefit scene understanding and other related downstream tasks. Existing SGG models often struggle to solve the long-tailed problem caused by biased datasets. However, even if these models can fit specific datasets better, it may be hard for them to resolve the unseen triples which are not included in the training set. Most methods tend to feed a whole triple and learn the overall features based on statistical machine learning. Such models have difficulty predicting unseen triples because the objects and predicates in the training set are combined differently as novel triples in the test set. In this work, we propose a Text-Image-joint Scene Graph Generation (TISGG) model to resolve the unseen triples and improve the generalisation capability of the SGG models. We propose a Joint Fearture Learning (JFL) module and a Factual Knowledge based Refinement (FKR) module to learn object and predicate categories separately at the feature level and align them with corresponding visual features so that the model is no longer limited to triples matching. Besides, since we observe the long-tailed problem also affects the generalization ability, we design a novel balanced learning strategy, including a Charater Guided Sampling (CGS) and an Informative Re-weighting (IR) module, to provide tailor-made learning methods for each predicate according to their characters. Extensive experiments show that our model achieves state-of-the-art performance. In more detail, TISGG boosts the performances by 11.7% of zR@20(zero-shot recall) on the PredCls sub-task on the Visual Genome dataset.
- J. Johnson, R. Krishna, M. Stark, L.-J. Li, D. Shamma, M. Bernstein, and L. Fei-Fei, “Image retrieval using scene graphs,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 3668–3678.
- A. Hogan, E. Blomqvist, M. Cochez, C. d’Amato, G. d. Melo, C. Gutierrez, S. Kirrane, J. E. L. Gayo, R. Navigli, S. Neumaier et al., “Knowledge graphs,” ACM Computing Surveys (CSUR), vol. 54, no. 4, pp. 1–37, 2021.
- A. Joulin, E. Grave, P. Bojanowski, M. Nickel, and T. Mikolov, “Fast linear model for knowledge graph embeddings,” arXiv preprint arXiv:1710.10881, 2017.
- M. Hildebrandt, H. Li, R. Koner, V. Tresp, and S. Günnemann, “Scene graph reasoning for visual question answering,” arXiv preprint arXiv:2007.01072, 2020.
- J. Wang, Z. Zeng, B. Chen, T. Dai, and S.-T. Xia, “Contrastive quantization with code memory for unsupervised image retrieval,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2022, pp. 2468–2476.
- Y. Luo, J. Ji, X. Sun, L. Cao, Y. Wu, F. Huang, C.-W. Lin, and R. Ji, “Dual-level collaborative transformer for image captioning,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2021, pp. 2286–2293.
- Z. Fei, “Attention-aligned transformer for image captioning,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2022, pp. 607–615.
- D. Teney, L. Liu, and A. van Den Hengel, “Graph-structured representations for visual question answering,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 1–9.
- A. Cherian, C. Hori, T. K. Marks, and J. Le Roux, “(2.5+ 1) d spatio-temporal scene graphs for video question answering,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2022, pp. 444–453.
- M. Li and M.-F. Moens, “Dynamic key-value memory enhanced multi-step graph reasoning for knowledge-based visual question answering,” arXiv preprint arXiv:2203.02985, 2022.
- C. Lu, R. Krishna, M. Bernstein, and L. Fei-Fei, “Visual relationship detection with language priors,” in Proceedings of the European Conference on Computer Vision, 2016, pp. 852–869.
- B. Dai, Y. Zhang, and D. Lin, “Detecting visual relationships with deep relational networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 3298–3308.
- R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L.-J. Li, D. A. Shamma et al., “Visual genome: Connecting language and vision using crowdsourced dense image annotations,” International Journal of Computer Vision, pp. 32–73, 2017.
- H. Zhang, Z. Kyaw, S.-F. Chang, and T.-S. Chua, “Visual translation embedding network for visual relation detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 5532–5540.
- K. Tang, Y. Niu, J. Huang, J. Shi, and H. Zhang, “Unbiased scene graph generation from biased training,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020, pp. 3716–3725.
- R. Li, S. Zhang, B. Wan, and X. He, “Bipartite graph network with adaptive message passing for unbiased scene graph generation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021, pp. 11 109–11 119.
- M. Suhail, A. Mittal, B. Siddiquie, C. Broaddus, J. Eledath, G. Medioni, and L. Sigal, “Energy-based learning for scene graph generation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021, pp. 13 936–13 945.
- Y. Guo, L. Gao, X. Wang, Y. Hu, X. Xu, X. Lu, H. T. Shen, and J. Song, “From general to specific: Informative scene graph generation via balance adjustment,” in Proceedings of the IEEE International Conference on Computer Vision., 2021, pp. 16 383–16 392.
- D. Xu, Y. Zhu, C. B. Choy, and L. Fei-Fei, “Scene graph generation by iterative message passing,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 5410–5419.
- J. Yang, J. Lu, S. Lee, D. Batra, and D. Parikh, “Graph r-cnn for scene graph generation,” in Proceedings of the European Conference on Computer Vision, 2018, pp. 670–685.
- K. Tang, H. Zhang, B. Wu, W. Luo, and W. Liu, “Learning to compose dynamic tree structures for visual contexts,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 6619–6628.
- T. Chen, W. Yu, R. Chen, and L. Lin, “Knowledge-embedded routing network for scene graph generation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 6163–6171.
- R. Zellers, M. Yatskar, S. Thomson, and Y. Choi, “Neural motifs: Scene graph parsing with global context,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 5831–5840.
- L. Chen, H. Zhang, J. Xiao, X. He, S. Pu, and S.-F. Chang, “Counterfactual critic multi-agent training for scene graph generation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 4613–4623.
- S. Yan, C. Shen, Z. Jin, J. Huang, R. Jiang, Y. Chen, and X.-S. Hua, “Pcpl: Predicate-correlation perception learning for unbiased scene graph generation,” in Proceedings of the ACM International Conference on Multimedia, 2020, pp. 265–273.
- Y. Xian, C. H. Lampert, B. Schiele, and Z. Akata, “Zero-shot learning—a comprehensive evaluation of the good, the bad and the ugly,” IEEE transactions on pattern analysis and machine intelligence, vol. 41, no. 9, pp. 2251–2265, 2018.
- S. Narayan, A. Gupta, F. S. Khan, C. G. Snoek, and L. Shao, “Latent embedding feedback and discriminative features for zero-shot classification,” in European Conference on Computer Vision. Springer, 2020, pp. 479–495.
- J. Li, M. Jing, K. Lu, Z. Ding, L. Zhu, and Z. Huang, “Leveraging the invariant side of generative zero-shot learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 7402–7411.
- X. Yang, H. Zhang, and J. Cai, “Shuffle-then-assemble: Learning object-agnostic visual relationship features,” in Proceedings of the European conference on computer vision, 2018, pp. 36–52.
- B. Knyazev, H. de Vries, C. Cangea, G. W. Taylor, A. Courville, and E. Belilovsky, “Generative compositional augmentations for scene graph prediction,” in Proceedings of the IEEE International Conference on Computer Vision, 2021, pp. 15 827–15 837.
- H. Liu, N. Yan, M. Mortazavi, and B. Bhanu, “Fully convolutional scene graph generation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021, pp. 11 546–11 556.
- A. Goel, B. Fernando, F. Keller, and H. Bilen, “Not all relations are equal: Mining informative labels for scene graph generation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 15 596–15 606.
- D. Rafailidis, S. Manolopoulou, and P. Daras, “A unified framework for multimodal retrieval,” Pattern Recognition, vol. 46, no. 12, pp. 3358–3370, 2013.
- M. Long, Y. Cao, J. Wang, and P. S. Yu, “Composite correlation quantization for efficient multimodal retrieval,” in Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval, 2016, pp. 579–588.
- D. Ramachandram and G. W. Taylor, “Deep multimodal learning: A survey on recent advances and trends,” IEEE signal processing magazine, vol. 34, no. 6, pp. 96–108, 2017.
- X. Ochoa, A. C. Lang, and G. Siemens, “Multimodal learning analytics,” The handbook of learning analytics, vol. 1, pp. 129–141, 2017.
- Y. Ning, S. He, Z. Wu, C. Xing, and L.-J. Zhang, “A review of deep learning based speech synthesis,” Applied Sciences, vol. 9, no. 19, p. 4050, 2019.
- K. Drossos, S. Lipping, and T. Virtanen, “Clotho: An audio captioning dataset,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 736–740.
- B. Shillingford, Y. Assael, M. W. Hoffman, T. Paine, C. Hughes, U. Prabhu, H. Liao, H. Sak, K. Rao, L. Bennett et al., “Large-scale visual speech recognition,” arXiv preprint arXiv:1807.05162, 2018.
- L. Chen, S. Srivastava, Z. Duan, and C. Xu, “Deep cross-modal audio-visual generation,” in Proceedings of the on Thematic Workshops of ACM Multimedia 2017, 2017, pp. 349–357.
- C.-H. Wan, S.-P. Chuang, and H.-Y. Lee, “Towards audio to scene image synthesis using generative adversarial network,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 496–500.
- M. E. ElAlami, “A novel image retrieval model based on the most relevant features,” Knowledge-Based Systems, vol. 24, no. 1, pp. 23–32, 2011.
- V. Gabeur, C. Sun, K. Alahari, and C. Schmid, “Multi-modal transformer for video retrieval,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IV 16. Springer, 2020, pp. 214–229.
- M. Z. Hossain, F. Sohel, M. F. Shiratuddin, and H. Laga, “A comprehensive survey of deep learning for image captioning,” ACM Computing Surveys (CsUR), vol. 51, no. 6, pp. 1–36, 2019.
- B. Wang, L. Ma, W. Zhang, and W. Liu, “Reconstruction network for video captioning,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 7622–7631.
- T. Xu, P. Zhang, Q. Huang, H. Zhang, Z. Gan, X. Huang, and X. He, “Attngan: Fine-grained text to image generation with attentional generative adversarial networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 1316–1324.
- Y. Li, M. Min, D. Shen, D. Carlson, and L. Carin, “Video generation from text,” in Proceedings of the AAAI conference on artificial intelligence, vol. 32, no. 1, 2018.
- D. Elliott and A. Kádár, “Imagination improves multimodal translation,” arXiv preprint arXiv:1705.04350, 2017.
- M. Turk, “Multimodal interaction: A review,” Pattern recognition letters, vol. 36, pp. 189–195, 2014.
- M. Soleymani, D. Garcia, B. Jou, B. Schuller, S.-F. Chang, and M. Pantic, “A survey of multimodal sentiment analysis,” Image and Vision Computing, vol. 65, pp. 3–14, 2017.
- C. Deng, Q. Wu, Q. Wu, F. Hu, F. Lyu, and M. Tan, “Visual grounding via accumulated attention,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 7746–7755.
- J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
- L. Van der Maaten and G. Hinton, “Visualizing data using t-sne.” Journal of Machine Learning Research, 2008.
- A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in Neural Information Processing Systems, pp. 5998–6008, 2017.
- J. Zhang, K. J. Shih, A. Elgammal, A. Tao, and B. Catanzaro, “Graphical contrastive losses for scene graph generation.” 2019.
- X. Lin, C. Ding, J. Zeng, and D. Tao, “Gps-net: Graph property sensing network for scene graph generation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 3746–3753.
- A. Kuznetsova, H. Rom, N. Alldrin, J. Uijlings, I. Krasin, J. Pont-Tuset, S. Kamali, S. Popov, M. Malloci, A. Kolesnikov et al., “The open images dataset v4,” International Journal of Computer Vision, vol. 128, no. 7, pp. 1956–1981, 2020.
- G. Yang, J. Zhang, Y. Zhang, B. Wu, and Y. Yang, “Probabilistic modeling of semantic ambiguity for scene graph generation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021, pp. 12 527–12 536.
- W. Li, H. Zhang, Q. Bai, G. Zhao, N. Jiang, and X. Yuan, “Ppdl: Predicate probability distribution based loss for unbiased scene graph generation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2022, pp. 19 447–19 456.
- S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” Advances in Neural Information Processing Systems, pp. 91–99, 2015.
- T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie, “Feature pyramid networks for object detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 2117–2125.