Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
143 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Unbiased Scene Graph Generation in Videos (2304.00733v3)

Published 3 Apr 2023 in cs.CV

Abstract: The task of dynamic scene graph generation (SGG) from videos is complicated and challenging due to the inherent dynamics of a scene, temporal fluctuation of model predictions, and the long-tailed distribution of the visual relationships in addition to the already existing challenges in image-based SGG. Existing methods for dynamic SGG have primarily focused on capturing spatio-temporal context using complex architectures without addressing the challenges mentioned above, especially the long-tailed distribution of relationships. This often leads to the generation of biased scene graphs. To address these challenges, we introduce a new framework called TEMPURA: TEmporal consistency and Memory Prototype guided UnceRtainty Attenuation for unbiased dynamic SGG. TEMPURA employs object-level temporal consistencies via transformer-based sequence modeling, learns to synthesize unbiased relationship representations using memory-guided training, and attenuates the predictive uncertainty of visual relations using a Gaussian Mixture Model (GMM). Extensive experiments demonstrate that our method achieves significant (up to 10% in some cases) performance gain over existing methods highlighting its superiority in generating more unbiased scene graphs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (65)
  1. Vivit: A video vision transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6836–6846, 2021.
  2. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
  3. Estimating and exploiting the aleatoric uncertainty in surface normal estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 13137–13146, 2021.
  4. Object level visual reasoning in videos. In Proceedings of the European Conference on Computer Vision (ECCV), pages 105–121, 2018.
  5. End-to-end object detection with transformers. In European conference on computer vision, pages 213–229. Springer, 2020.
  6. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6299–6308, 2017.
  7. Knowledge-embedded routing network for scene graph generation. In Conference on Computer Vision and Pattern Recognition, 2019.
  8. Active learning for deep object detection via probabilistic modeling. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10264–10273, 2021.
  9. Uncertainty-aware learning from demonstration using mixture density networks with sampling-free variance modeling. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 6915–6922. IEEE, 2018.
  10. Spatial-Temporal Transformer for Dynamic Scene Graph Generation. In Proceedings of the International Conference on Computer Vision (ICCV), October 2021.
  11. Aleatory or epistemic? does it matter? Structural safety, 31(2):105–112, 2009.
  12. Learning of visual relations: The devil is in the tails. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15404–15413, 2021.
  13. Deep bayesian active learning with image data. In International Conference on Machine Learning, pages 1183–1192. PMLR, 2017.
  14. Image captioning with scene-graph based semantic concepts. In Proceedings of the 2018 10th International Conference on Machine Learning and Computing, ICMLC 2018, page 225–229, New York, NY, USA, 2018. Association for Computing Machinery.
  15. Dynamic few-shot visual learning without forgetting. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4367–4375, 2018.
  16. A simple baseline for weakly-supervised human-centric relation detection. 2021.
  17. Neural turing machines. arXiv preprint arXiv:1410.5401, 2014.
  18. Dimensionality reduction by learning an invariant mapping. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), volume 2, pages 1735–1742. IEEE, 2006.
  19. A survey on vision transformer. IEEE transactions on pattern analysis and machine intelligence, 2022.
  20. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 2961–2969, 2017.
  21. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016.
  22. Uncertainty-aware learning against label noise on imbalanced datasets. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 6960–6969, 2022.
  23. GQA: A new dataset for real-world visual reasoning and compositional question answering. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
  24. Detecting human-object relationships in videos. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8106–8116, 2021.
  25. Action Genome: Actions as Composition of Spatio-temporal Scene Graphs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.
  26. Prompting visual-language models for efficient video understanding. In European Conference on Computer Vision, pages 105–124. Springer, 2022.
  27. Learning to remember rare events. arXiv preprint arXiv:1703.03129, 2017.
  28. What uncertainties do we need in bayesian deep learning for computer vision? In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, page 5580–5590, Red Hook, NY, USA, 2017. Curran Associates Inc.
  29. Iterative scene graph generation. arXiv preprint arXiv:2207.13440, 2022.
  30. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
  31. Visual genome: Connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vision, 123(1):32–73, May 2017.
  32. Simple and scalable predictive uncertainty estimation using deep ensembles. Advances in neural information processing systems, 30, 2017.
  33. Sgtr: End-to-end scene graph generation with transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19486–19496, 2022.
  34. Bipartite graph network with adaptive message passing for unbiased scene graph generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11109–11119, 2021.
  35. Ppdl: Predicate probability distribution based loss for unbiased scene graph generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19447–19456, 2022.
  36. Rethinking the evaluation of unbiased scene graph generation. arXiv preprint arXiv:2208.01909, 2022.
  37. Scene graph generation from objects, phrases and region captions. 2017 IEEE International Conference on Computer Vision (ICCV), pages 1270–1279, 2017.
  38. Dynamic scene graph generation via anticipatory pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13874–13883, June 2022.
  39. Gps-net: Graph property sensing network for scene graph generation. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pages 3746–3753, 2020.
  40. Forecasting human-object interaction: joint prediction of motor attention and actions in first person video. In European Conference on Computer Vision, pages 704–721. Springer, 2020.
  41. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  42. Visual relationship detection with language priors. In European Conference on Computer Vision, 2016.
  43. Albert Michotte. The perception of causality. Routledge, 2017.
  44. Activity graph transformer for temporal action localization. arXiv preprint arXiv:2101.08540, 2021.
  45. Long-tail recognition via compositional knowledge transfer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6939–6948, 2022.
  46. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532–1543, 2014.
  47. Faster r-cnn: Towards real-time object detection with region proposal networks. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 1, NIPS’15, page 91–99, Cambridge, MA, USA, 2015. MIT Press.
  48. Meta-learning with memory-augmented neural networks. In International conference on machine learning, pages 1842–1850. PMLR, 2016.
  49. Generating semantically precise scene graphs from textual descriptions for improved image retrieval. In Proceedings of the Fourth Workshop on Vision and Language, pages 70–80, Lisbon, Portugal, Sept. 2015. Association for Computational Linguistics.
  50. Hollywood in homes: Crowdsourcing data collection for activity understanding. In European Conference on Computer Vision, pages 510–526. Springer, 2016.
  51. Prototypical networks for few-shot learning. Advances in neural information processing systems, 30, 2017.
  52. End-to-end memory networks. Advances in neural information processing systems, 28, 2015.
  53. Videobert: A joint model for video and language representation learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2019.
  54. Single-model uncertainties for deep learning. Advances in Neural Information Processing Systems, 32, 2019.
  55. Unbiased scene graph generation from biased training. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3716–3725, 2020.
  56. Learning to compose dynamic tree structures for visual contexts. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
  57. Target adaptive context aggregation for video scene graph generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 13688–13697, 2021.
  58. Attention is all you need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017.
  59. Dynamic scene graph generation via temporal prior inference. In ACM International Conference on Multimedia (MM ’22), 2022.
  60. Cross-modal scene graph matching for relationship-aware image-text retrieval. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 1508–1517, 2020.
  61. Unbiased scene graph generation via rich and fair semantic extraction. arXiv preprint arXiv:2002.00176, 2020.
  62. Neural motifs: Scene graph parsing with global context. CoRR, abs/1711.06640, 2017.
  63. Graphical contrastive losses for scene graph parsing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11535–11543, 2019.
  64. Graphical contrastive losses for scene graph parsing. In CVPR, 2019.
  65. Inflated episodic memory with region self-attention for long-tailed visual recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4344–4353, 2020.
Citations (18)

Summary

We haven't generated a summary for this paper yet.