Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

TextPSG: Panoptic Scene Graph Generation from Textual Descriptions (2310.07056v1)

Published 10 Oct 2023 in cs.CV

Abstract: Panoptic Scene Graph has recently been proposed for comprehensive scene understanding. However, previous works adopt a fully-supervised learning manner, requiring large amounts of pixel-wise densely-annotated data, which is always tedious and expensive to obtain. To address this limitation, we study a new problem of Panoptic Scene Graph Generation from Purely Textual Descriptions (Caption-to-PSG). The key idea is to leverage the large collection of free image-caption data on the Web alone to generate panoptic scene graphs. The problem is very challenging for three constraints: 1) no location priors; 2) no explicit links between visual regions and textual entities; and 3) no pre-defined concept sets. To tackle this problem, we propose a new framework TextPSG consisting of four modules, i.e., a region grouper, an entity grounder, a segment merger, and a label generator, with several novel techniques. The region grouper first groups image pixels into different segments and the entity grounder then aligns visual segments with language entities based on the textual description of the segment being referred to. The grounding results can thus serve as pseudo labels enabling the segment merger to learn the segment similarity as well as guiding the label generator to learn object semantics and relation predicates, resulting in a fine-grained structured scene understanding. Our framework is effective, significantly outperforming the baselines and achieving strong out-of-distribution robustness. We perform comprehensive ablation studies to corroborate the effectiveness of our design choices and provide an in-depth analysis to highlight future directions. Our code, data, and results are available on our project page: https://vis-www.cs.umass.edu/TextPSG.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (63)
  1. Image understanding using vision and reasoning through scene description graph. Computer Vision and Image Understanding, 2017.
  2. Reasoning with scene graphs for robot planning under partial observability. IEEE Robotics and Automation Letters, 7:5560–5567, 2022.
  3. Leveraging linguistic structure for open domain information extraction. In Annual Meeting of the Association for Computational Linguistics, 2015.
  4. Conceptual 12M: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In CVPR, 2021.
  5. Microsoft COCO captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325, 2015.
  6. Weakly-supervised spatio-temporally grounding natural sentence in video. In Proc. 57th Annual Meeting of the Association for Computational Linguistics, 2019.
  7. Grounding physical concepts of objects and events through dynamic visual reasoning. In International Conference on Learning Representations, 2021.
  8. Cops-ref: A new dataset and task on compositional referring expression comprehension. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10086–10095, 2020.
  9. Semantic image manipulation using scene graphs. In CVPR, 2020.
  10. Maskclip: Masked self-distillation advances contrastive language-image pretraining. arXiv preprint arXiv:2208.12262, 2022.
  11. Continuous scene representations for embodied ai. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14829–14839, 2022.
  12. Vqs: Linking segmentations to questions and answers for supervised attention in vqa and question-focused semantic segmentation. In Proceedings of the IEEE international conference on computer vision, pages 1811–1820, 2017.
  13. Scene graph generation with external knowledge and image reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
  14. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6700–6709, 2019.
  15. Image generation from scene graphs. In CVPR, 2018.
  16. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3128–3137, 2015.
  17. Referitgame: Referring to objects in photographs of natural scenes. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 787–798, 2014.
  18. Segmentation-grounded scene graph generation. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 15859–15869, 2021.
  19. Visual Genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision, 123:32–73, 2016.
  20. The open images dataset v4. International Journal of Computer Vision, 128:1956–1981, 2018.
  21. Language-driven semantic segmentation. arXiv preprint arXiv:2201.03546, 2022.
  22. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In ICML, 2022.
  23. Integrating object-aware and interaction-aware knowledge for weakly supervised scene graph generation. Proceedings of the 30th ACM International Conference on Multimedia, 2022.
  24. Microsoft coco: Common objects in context. In European Conference on Computer Vision, 2014.
  25. Gps-net: Graph property sensing network for scene graph generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.
  26. Robust recovery of subspace structures by low-rank representation. IEEE transactions on pattern analysis and machine intelligence, 35(1):171–184, 2012.
  27. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  28. Image segmentation using text and image prompts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7086–7096, 2022.
  29. Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. arXiv preprint arXiv:2211.14813, 2022.
  30. The Stanford CoreNLP natural language processing toolkit. In Association for Computational Linguistics (ACL) System Demonstrations, pages 55–60, 2014.
  31. Generation and comprehension of unambiguous object descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 11–20, 2016.
  32. A framework for multiple-instance learning. Advances in neural information processing systems, 10, 1997.
  33. George A. Miller. WordNet: A lexical database for English. In Speech and Natural Language: Proceedings of a Workshop Held at Harriman, New York, February 23-26, 1992, 1992.
  34. Weakly-supervised learning of visual relations. 2017 IEEE International Conference on Computer Vision (ICCV), pages 5189–5198, 2017.
  35. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In Proceedings of the IEEE international conference on computer vision, pages 2641–2649, 2015.
  36. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763, 2021.
  37. Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems, 28, 2015.
  38. Grounding of textual phrases in images by reconstruction. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14, pages 817–834. Springer, 2016.
  39. Generating semantically precise scene graphs from textual descriptions for improved image retrieval. In VL@EMNLP, 2015.
  40. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2556–2565, July 2018.
  41. Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8):888–905, 2000.
  42. Explainable and explicit visual reasoning over scene graphs. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8376–8384, 2019.
  43. A simple baseline for weakly-supervised scene graph generation. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 16373–16382, 2021.
  44. Unbiased scene graph generation from biased training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.
  45. Graph-structured representations for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
  46. Yfcc100m: The new data in multimedia research. Communications of the ACM, 59(2):64–73, 2016.
  47. Selective search for object recognition. International journal of computer vision, 104:154–171, 2013.
  48. Attention is all you need. In NIPS, 2017.
  49. 1st place solution for psg competition with eccv’22 sensehuman workshop. ArXiv, abs/2302.02651, 2023.
  50. Unified visual-semantic embeddings: Bridging vision and language with structured meaning representations. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6602–6611, 2019.
  51. Scene graph generation by iterative message passing. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5410–5419, 2017.
  52. GroupViT: Semantic segmentation emerges from text supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18134–18144, June 2022.
  53. Panoptic scene graph generation. In ECCV, 2022.
  54. Graph r-cnn for scene graph generation. In Proceedings of the European Conference on Computer Vision (ECCV), September 2018.
  55. FILIP: Fine-grained interactive language-image pre-training. In International Conference on Learning Representations, 2022.
  56. Linguistic structures as weak supervision for visual scene graph generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2021.
  57. Mattnet: Modular attention network for referring expression comprehension. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1307–1315, 2018.
  58. Modeling context in referring expressions. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II 14, pages 69–85. Springer, 2016.
  59. Weakly supervised visual semantic parsing. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3733–3742, 2020.
  60. Dense regression network for video grounding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10287–10296, 2020.
  61. Ppr-fcn: Weakly supervised visual relation detection via parallel pairwise r-fcn. In Proceedings of the IEEE international conference on computer vision, pages 4233–4241, 2017.
  62. Learning to generate scene graph from natural language supervision. In ICCV, 2021.
  63. Extract free dense labels from clip. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXVIII, pages 696–712. Springer, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Chengyang Zhao (7 papers)
  2. Yikang Shen (62 papers)
  3. Zhenfang Chen (36 papers)
  4. Mingyu Ding (82 papers)
  5. Chuang Gan (195 papers)
Citations (10)

Summary

We haven't generated a summary for this paper yet.