FloCoDe: Unbiased Dynamic Scene Graph Generation with Temporal Consistency and Correlation Debiasing (2310.16073v3)
Abstract: Dynamic scene graph generation (SGG) from videos requires not only a comprehensive understanding of objects across scenes but also a method to capture the temporal motions and interactions with different objects. Moreover, the long-tailed distribution of visual relationships is a crucial bottleneck for most dynamic SGG methods. This is because many of them focus on capturing spatio-temporal context using complex architectures, leading to the generation of biased scene graphs. To address these challenges, we propose FloCoDe: Flow-aware Temporal Consistency and Correlation Debiasing with uncertainty attenuation for unbiased dynamic scene graphs. FloCoDe employs feature warping using flow to detect temporally consistent objects across frames. To address the long-tail issue of visual relationships, we propose correlation debiasing and a label correlation-based loss to learn unbiased relation representations for long-tailed classes. Specifically, we propose to incorporate label correlations using contrastive loss to capture commonly co-occurring relations, which aids in learning robust representations for long-tailed classes. Further, we adopt the uncertainty attenuation-based classifier framework to handle noisy annotations in the SGG data. Extensive experimental evaluation shows a performance gain as high as 4.1%, demonstrating the superiority of generating more unbiased scene graphs.
- Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision, pages 2425–2433, 2015.
- Self-supervised augmentation consistency for adapting semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15384–15394, 2021.
- Vivit: A video vision transformer. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6836–6846, 2021.
- End-to-end object detection with transformers. In European conference on computer vision, pages 213–229. Springer, 2020.
- Knowledge-embedded routing network for scene graph generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6163–6171, 2019.
- Uncertainty estimation in deep neural networks for dermoscopic image classification. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pages 744–745, 2020.
- Spatial-temporal transformer for dynamic scene graph generation. In Proceedings of the IEEE/CVF international conference on computer vision, pages 16372–16382, 2021.
- Learning of visual relations: The devil is in the tails. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15404–15413, 2021.
- Dual encoding for video retrieval by text. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(8):4065–4080, 2021.
- Flownet: Learning optical flow with convolutional networks. In Proceedings of the IEEE international conference on computer vision, pages 2758–2766, 2015.
- Semi-supervised semantic segmentation needs strong, varied perturbations. arXiv preprint arXiv:1906.01916, 2019.
- A simple baseline for weakly-supervised human-centric relation detection. 2021.
- A survey on vision transformer. IEEE transactions on pattern analysis and machine intelligence, 45(1):87–110, 2022.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
- Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 2961–2969, 2017.
- Three ways to improve semantic segmentation with self-supervised depth estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11130–11140, 2021.
- Action genome: Actions as compositions of spatio-temporal scene graphs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10236–10247, 2020.
- Detecting human-object relationships in videos. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8106–8116, 2021.
- Prompting visual-language models for efficient video understanding. In European Conference on Computer Vision, pages 105–124. Springer, 2022.
- Anant Khandelwal. Segda: Maximum separable segment mask with pseudo labels for domain adaptive semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2158–2168, 2023.
- Iterative scene graph generation. Advances in Neural Information Processing Systems, 35:24295–24308, 2022.
- Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision, 123:32–73, 2017.
- Sgtr: End-to-end scene graph generation with transformer. In proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 19486–19496, 2022a.
- Scene graph generation from objects, phrases and region captions. In Proceedings of the IEEE international conference on computer vision, pages 1261–1270, 2017.
- Interventional video relation detection. In Proceedings of the 29th ACM International Conference on Multimedia, pages 4091–4099, 2021.
- Dynamic scene graph generation via anticipatory pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13874–13883, 2022b.
- Gps-net: Graph property sensing network for scene graph generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3746–3753, 2020.
- Beyond short-term snippet: Video relation detection with spatio-temporal global context. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10840–10849, 2020.
- Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
- Visual relationship detection with language priors. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14, pages 852–869. Springer, 2016.
- Unbiased scene graph generation in videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22803–22813, 2023.
- Activity graph transformer for temporal action localization. arXiv preprint arXiv:2101.08540, 2021.
- Prevalence of neural collapse during the terminal phase of deep learning training. Proceedings of the National Academy of Sciences, 117(40):24652–24663, 2020.
- Spatial-temporal knowledge-embedded transformer for video scene graph generation. arXiv preprint arXiv:2309.13237, 2023.
- Video relation detection with spatio-temporal graph. In Proceedings of the 27th ACM international conference on multimedia, pages 84–93, 2019.
- Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems, 28, 2015.
- Hollywood in homes: Crowdsourcing data collection for activity understanding. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14, pages 510–526. Springer, 2016.
- Neural bridge sampling for evaluating safety-critical autonomous systems. Advances in Neural Information Processing Systems, 33:6402–6416, 2020.
- Concept-based video retrieval. Foundations and Trends® in Information Retrieval, 2(4):215–322, 2009.
- Videobert: A joint model for video and language representation learning. In Proceedings of the IEEE/CVF international conference on computer vision, pages 7464–7473, 2019.
- Learning to compose dynamic tree structures for visual contexts. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6619–6628, 2019.
- Unbiased scene graph generation from biased training. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3716–3725, 2020.
- Movieqa: Understanding stories in movies through question-answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4631–4640, 2016.
- Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. Advances in neural information processing systems, 30, 2017.
- Target adaptive context aggregation for video scene graph generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 13688–13697, 2021.
- Dacs: Domain adaptation via cross-domain mixed sampling. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1379–1389, 2021.
- Video relationship reasoning using gated spatio-temporal energy graph. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10424–10433, 2019.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Dynamic scene graph generation via temporal prior inference. In Proceedings of the 30th ACM International Conference on Multimedia, pages 5793–5801, 2022.
- Neural multimodal cooperative learning toward micro-video understanding. IEEE Transactions on Image Processing, 29:1–14, 2019.
- Unbiased scene graph generation via rich and fair semantic extraction. arXiv preprint arXiv:2002.00176, 2020.
- Next-qa: Next phase of question-answering to explaining temporal actions. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9777–9786, 2021.
- Show, attend and tell: Neural image caption generation with visual attention. In International conference on machine learning, pages 2048–2057. PMLR, 2015.
- Meta spatio-temporal debiasing for video scene graph generation. In European Conference on Computer Vision, pages 374–390. Springer, 2022.
- Neural collapse inspired feature-classifier alignment for few-shot class incremental learning. arXiv preprint arXiv:2302.03004, 2023.
- Neural motifs: Scene graph parsing with global context. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5831–5840, 2018.
- Graphical contrastive losses for scene graph parsing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11535–11543, 2019.
- Vrdformer: End-to-end video visual relation detection with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18836–18846, 2022.
- Flow-guided feature aggregation for video object detection. In Proceedings of the IEEE international conference on computer vision, pages 408–417, 2017a.
- Deep feature flow for video recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2349–2358, 2017b.
- Anant Khandelwal (13 papers)