Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
133 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SPAN: Learning Similarity between Scene Graphs and Images with Transformers (2304.00590v2)

Published 2 Apr 2023 in cs.CV

Abstract: Learning similarity between scene graphs and images aims to estimate a similarity score given a scene graph and an image. There is currently no research dedicated to this task, although it is critical for scene graph generation and downstream applications. Scene graph generation is conventionally evaluated by Recall$@K$ and mean Recall$@K$, which measure the ratio of predicted triplets that appear in the human-labeled triplet set. However, such triplet-oriented metrics fail to demonstrate the overall semantic difference between a scene graph and an image and are sensitive to annotation bias and noise. Using generated scene graphs in the downstream applications is therefore limited. To address this issue, for the first time, we propose a Scene graPh-imAge coNtrastive learning framework, SPAN, that can measure the similarity between scene graphs and images. Our novel framework consists of a graph Transformer and an image Transformer to align scene graphs and their corresponding images in the shared latent space. We introduce a novel graph serialization technique that transforms a scene graph into a sequence with structural encodings. Based on our framework, we propose R-Precision measuring image retrieval accuracy as a new evaluation metric for scene graph generation. We establish new benchmarks on the Visual Genome and Open Images datasets. Extensive experiments are conducted to verify the effectiveness of SPAN, which shows great potential as a scene graph encoder.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (50)
  1. J. Johnson, R. Krishna, M. Stark, L.-J. Li, D. Shamma, M. Bernstein, and L. Fei-Fei, “Image retrieval using scene graphs,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 3668–3678.
  2. B. Schroeder and S. Tripathi, “Structured query-based image retrieval using scene graphs,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2020, pp. 178–179.
  3. S. Wang, R. Wang, Z. Yao, S. Shan, and X. Chen, “Cross-modal scene graph matching for relationship-aware image-text retrieval,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), March 2020.
  4. S. Yoon, W. Y. Kang, S. Jeon, S. Lee, C. Han, J. Park, and E.-S. Kim, “Image-to-image retrieval by learning similarity between scene graphs,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 12, 2021, pp. 10 718–10 726.
  5. J. Johnson, A. Gupta, and L. Fei-Fei, “Image generation from scene graphs,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 1219–1228.
  6. Y. Li, T. Ma, Y. Bai, N. Duan, S. Wei, and X. Wang, “Pastegan: A semi-parametric method to generate image from scene graph,” Advances in Neural Information Processing Systems, vol. 32, 2019.
  7. R. Herzig, A. Bar, H. Xu, G. Chechik, T. Darrell, and A. Globerson, “Learning canonical representations for scene graph to image generation,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXVI 16.   Springer, 2020, pp. 210–227.
  8. L. Yang, Z. Huang, Y. Song, S. Hong, G. Li, W. Zhang, B. Cui, B. Ghanem, and M.-H. Yang, “Diffusion-based scene graph to image generation with masked contrastive pre-training,” arXiv preprint arXiv:2211.11138, 2022.
  9. L. Gao, B. Wang, and W. Wang, “Image captioning with scene-graph based semantic concepts,” in Proceedings of the 2018 10th international conference on machine learning and computing, 2018, pp. 225–229.
  10. X. Yang, K. Tang, H. Zhang, and J. Cai, “Auto-encoding scene graphs for image captioning,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 10 685–10 694.
  11. R. Zellers, M. Yatskar, S. Thomson, and Y. Choi, “Neural motifs: Scene graph parsing with global context,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 5831–5840.
  12. S. Yan, C. Shen, Z. Jin, J. Huang, R. Jiang, Y. Chen, and X.-S. Hua, “Pcpl: Predicate-correlation perception learning for unbiased scene graph generation,” in Proceedings of the 28th ACM International Conference on Multimedia, 2020, pp. 265–273.
  13. R. Li, S. Zhang, B. Wan, and X. He, “Bipartite graph network with adaptive message passing for unbiased scene graph generation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 11 109–11 119.
  14. K. Tang, Y. Niu, J. Huang, J. Shi, and H. Zhang, “Unbiased scene graph generation from biased training,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 3716–3725.
  15. M.-J. Chiou, H. Ding, H. Yan, C. Wang, R. Zimmermann, and J. Feng, “Recovering the unbiased scene graphs from the biased ones,” in Proceedings of the 29th ACM International Conference on Multimedia, 2021, pp. 1581–1590.
  16. M. Y. Yang, W. Liao, H. Ackermann, and B. Rosenhahn, “On support relations and semantic scene graphs,” ISPRS journal of photogrammetry and remote sensing, vol. 131, pp. 15–25, 2017.
  17. Y. Cong, M. Y. Yang, and B. Rosenhahn, “Reltr: Relation transformer for scene graph generation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 9, pp. 11 169–11 183, 2023.
  18. A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in International conference on machine learning.   PMLR, 2021, pp. 8748–8763.
  19. R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L.-J. Li, D. A. Shamma et al., “Visual genome: Connecting language and vision using crowdsourced dense image annotations,” International journal of computer vision, vol. 123, pp. 32–73, 2017.
  20. A. Kuznetsova, H. Rom, N. Alldrin, J. Uijlings, I. Krasin, J. Pont-Tuset, S. Kamali, S. Popov, M. Malloci, A. Kolesnikov et al., “The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale,” International Journal of Computer Vision, vol. 128, no. 7, pp. 1956–1981, 2020.
  21. J. Shi, H. Zhang, and J. Li, “Explainable and explicit visual reasoning over scene graphs,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 8376–8384.
  22. S. Lee, J.-W. Kim, Y. Oh, and J. H. Jeon, “Visual question answering over scene graph,” in 2019 First International Conference on Graph Computing (GC).   IEEE, 2019, pp. 45–50.
  23. M. Hildebrandt, H. Li, R. Koner, V. Tresp, and S. Günnemann, “Scene graph reasoning for visual question answering,” arXiv preprint arXiv:2007.01072, 2020.
  24. O. Ashual and L. Wolf, “Specifying object attributes and relations in interactive scene generation,” in Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 4561–4569.
  25. J. Yang, J. Lu, S. Lee, D. Batra, and D. Parikh, “Graph r-cnn for scene graph generation,” in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 670–685.
  26. J. Gu, H. Zhao, Z. Lin, S. Li, J. Cai, and M. Ling, “Scene graph generation with external knowledge and image reconstruction,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 1969–1978.
  27. L. Li, L. Chen, Y. Huang, Z. Zhang, S. Zhang, and J. Xiao, “The devil is in the labels: Noisy label correction for robust scene graph generation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 18 869–18 878.
  28. N. Dhingra, F. Ritter, and A. Kunz, “Bgt-net: Bidirectional gru transformer network for scene graph generation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 2150–2159.
  29. C. Lu, R. Krishna, M. Bernstein, and L. Fei-Fei, “Visual relationship detection with language priors,” in Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14.   Springer, 2016, pp. 852–869.
  30. K. Tang, H. Zhang, B. Wu, W. Luo, and W. Liu, “Learning to compose dynamic tree structures for visual contexts,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 6619–6628.
  31. A. Jaiswal, A. R. Babu, M. Z. Zadeh, D. Banerjee, and F. Makedon, “A survey on contrastive self-supervised learning,” Technologies, vol. 9, no. 1, p. 2, 2020.
  32. K. Desai and J. Johnson, “Virtex: Learning visual representations from textual annotations,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 11 162–11 173.
  33. X. Chen, H. Fang, T.-Y. Lin, R. Vedantam, S. Gupta, P. Dollár, and C. L. Zitnick, “Microsoft coco captions: Data collection and evaluation server,” arXiv preprint arXiv:1504.00325, 2015.
  34. H. Yuan, J. Jiang, S. Albanie, T. Feng, Z. Huang, D. Ni, and M. Tang, “Rlip: Relational language-image pre-training for human-object interaction detection,” Advances in Neural Information Processing Systems, vol. 35, pp. 37 416–37 431, 2022.
  35. H. Yuan, S. Zhang, X. Wang, S. Albanie, Y. Pan, T. Feng, J. Jiang, D. Ni, Y. Zhang, and D. Zhao, “Rlipv2: Fast scaling of relational language-image pre-training,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 21 649–21 661.
  36. J. Zhang, K. J. Shih, A. Elgammal, A. Tao, and B. Catanzaro, “Graphical contrastive losses for scene graph parsing,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 11 535–11 543.
  37. T. N. Kipf and M. Welling, “Semi-supervised classification with graph convolutional networks,” arXiv preprint arXiv:1609.02907, 2016.
  38. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in neural information processing systems, vol. 30, 2017.
  39. J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
  40. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
  41. N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-end object detection with transformers,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I 16.   Springer, 2020, pp. 213–229.
  42. D. Xu, Y. Zhu, C. B. Choy, and L. Fei-Fei, “Scene graph generation by iterative message passing,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 5410–5419.
  43. I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” arXiv preprint arXiv:1711.05101, 2017.
  44. Y. Cong, H. Ackermann, W. Liao, M. Y. Yang, and B. Rosenhahn, “Nodis: Neural ordinary differential scene understanding,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XX 16.   Springer, 2020, pp. 636–653.
  45. M. Suhail, A. Mittal, B. Siddiquie, C. Broaddus, J. Eledath, G. Medioni, and L. Sigal, “Energy-based learning for scene graph generation,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 13 936–13 945.
  46. X. Dong, T. Gan, X. Song, J. Wu, Y. Cheng, and L. Nie, “Stacked hybrid-attention and group collaborative learning for unbiased scene graph generation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 19 427–19 436.
  47. X. Lyu, L. Gao, Y. Guo, Z. Zhao, H. Huang, H. T. Shen, and J. Song, “Fine-grained predicates learning for scene graph generation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 19 467–19 475.
  48. H. Liu, N. Yan, M. Mortazavi, and B. Bhanu, “Fully convolutional scene graph generation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 11 546–11 556.
  49. Y. Teng and L. Wang, “Structured sparse r-cnn for direct scene graph generation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 19 437–19 446.
  50. R. Li, S. Zhang, and X. He, “Sgtr: End-to-end scene graph generation with transformer,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 19 486–19 496.
Citations (6)

Summary

We haven't generated a summary for this paper yet.