Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
166 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

NICEST: Noisy Label Correction and Training for Robust Scene Graph Generation (2207.13316v2)

Published 27 Jul 2022 in cs.CV

Abstract: Nearly all existing scene graph generation (SGG) models have overlooked the ground-truth annotation qualities of mainstream SGG datasets, i.e., they assume: 1) all the manually annotated positive samples are equally correct; 2) all the un-annotated negative samples are absolutely background. In this paper, we argue that neither of the assumptions applies to SGG: there are numerous noisy ground-truth predicate labels that break these two assumptions and harm the training of unbiased SGG models. To this end, we propose a novel NoIsy label CorrEction and Sample Training strategy for SGG: NICEST. Specifically, it consists of two parts: NICE and NIST, which rule out these noisy label issues by generating high-quality samples and the effective training strategy, respectively. NICE first detects noisy samples and then reassigns them more high-quality soft predicate labels. NIST is a multi-teacher knowledge distillation based training strategy, which enables the model to learn unbiased fusion knowledge. And a dynamic trade-off weighting strategy in NIST is designed to penalize the bias of different teachers. Due to the model-agnostic nature of both NICE and NIST, our NICEST can be seamlessly incorporated into any SGG architecture to boost its performance on different predicate categories. In addition, to better evaluate the generalization of SGG models, we further propose a new benchmark VG-OOD, by re-organizing the prevalent VG dataset and deliberately making the predicate distributions of the training and test sets as different as possible for each subject-object category pair. This new benchmark helps disentangle the influence of subject-object category based frequency biases. Extensive ablations and results on different backbones and tasks have attested to the effectiveness and generalization ability of each component of NICEST.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (92)
  1. D. Xu, Y. Zhu, C. B. Choy, and L. Fei-Fei, “Scene graph generation by iterative message passing,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2017, pp. 5410–5419.
  2. X. Chang, P. Ren, P. Xu, Z. Li, X. Chen, and A. G. Hauptmann, “A comprehensive survey of scene graphs: Generation and application,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021.
  3. L. Li, J. Xiao, G. Chen, J. Shao, Y. Zhuang, and L. Chen, “Zero-shot visual relation detection via composite visual cues from large language models,” arXiv preprint arXiv:2305.12476, 2023.
  4. L. Li, G. Chen, J. Xiao, and L. Chen, “Compositional zero-shot learning via progressive language-based observations,” arXiv preprint arXiv:2311.14749, 2023.
  5. J. Johnson, R. Krishna, M. Stark, L.-J. Li, D. Shamma, M. Bernstein, and L. Fei-Fei, “Image retrieval using scene graphs,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 3668–3678.
  6. S. Wang, R. Wang, Z. Yao, S. Shan, and X. Chen, “Cross-modal scene graph matching for relationship-aware image-text retrieval,” in Proceedings of the IEEE/CVF winter conference on applications of computer vision, 2020, pp. 1508–1517.
  7. L. Feng and B. Bhanu, “Semantic concept co-occurrence patterns for image annotation and retrieval,” IEEE transactions on pattern analysis and machine intelligence, vol. 38, no. 4, pp. 785–799, 2015.
  8. S. Lee, J.-W. Kim, Y. Oh, and J. H. Jeon, “Visual question answering over scene graph,” in 2019 First International Conference on Graph Computing (GC).   IEEE, 2019, pp. 45–50.
  9. J. Shi, H. Zhang, and J. Li, “Explainable and explicit visual reasoning over scene graphs,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 8376–8384.
  10. L. Chen, X. Yan, J. Xiao, H. Zhang, S. Pu, and Y. Zhuang, “Counterfactual samples synthesizing for robust visual question answering,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 10 800–10 809.
  11. D.-J. Kim, T.-H. Oh, J. Choi, and I. S. Kweon, “Dense relational image captioning via multi-task triple-stream networks,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021.
  12. Z. Wang, J. Xiao, L. Chen, F. Gao, J. Shao, and L. Chen, “Learning combinatorial prompts for universal controllable image captioning,” arXiv preprint arXiv:2303.06338, 2023.
  13. Z. Wang, L. Chen, W. Ma, G. Han, Y. Niu, J. Shao, and J. Xiao, “Explicit image caption editing,” in Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXVI.   Springer, 2022, pp. 113–129.
  14. W. Li, H. Zhang, Q. Bai, G. Zhao, N. Jiang, and X. Yuan, “Ppdl: Predicate probability distribution based loss for unbiased scene graph generation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 19 447–19 456.
  15. X. Lin, C. Ding, Y. Zhan, Z. Li, and D. Tao, “Hl-net: Heterophily learning network for scene graph generation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 19 476–19 485.
  16. X. Li, L. Chen, G. Chen, Y. Feng, Y. Yang, and J. Xiao, “Decomposed prototype learning for few-shot scene graph generation,” arXiv preprint arXiv:2303.10863, 2023.
  17. K. Gao, L. Chen, Y. Niu, J. Shao, and X. Jun, “Classification-then-grounding: Reformulating video scene graphs as temporal bipartite graphs,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022.
  18. K. Gao, L. Chen, H. Zhang, J. Xiao, and Q. Sun, “Compositional prompt tuning with motion cues for open-vocabulary video relation detection,” arXiv preprint arXiv:2302.00268, 2023.
  19. R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L.-J. Li, D. A. Shamma et al., “Visual genome: Connecting language and vision using crowdsourced dense image annotations,” International Journal of Computer Vision, 2017.
  20. D. A. Hudson and C. D. Manning, “Gqa: A new dataset for real-world visual reasoning and compositional question answering,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 6700–6709.
  21. Z.-S. Hung, A. Mallya, and S. Lazebnik, “Contextual translation embedding for visual relationship detection and scene graph generation,” IEEE transactions on pattern analysis and machine intelligence, vol. 43, no. 11, pp. 3820–3832, 2020.
  22. F. Shao, Y. Luo, L. Zhang, L. Ye, S. Tang, Y. Yang, and J. Xiao, “Improving weakly supervised object localization via causal intervention,” in Proceedings of the 29th ACM International Conference on Multimedia, 2021, pp. 3321–3329.
  23. F. Shao, L. Chen, J. Shao, W. Ji, S. Xiao, L. Ye, Y. Zhuang, and J. Xiao, “Deep learning for weakly-supervised object detection and localization: A survey,” Neurocomputing, vol. 496, pp. 192–207, 2022.
  24. R. Zellers, M. Yatskar, S. Thomson, and Y. Choi, “Neural motifs: Scene graph parsing with global context,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, pp. 5831–5840.
  25. L. Chen, H. Zhang, J. Xiao, X. He, S. Pu, and S.-F. Chang, “Counterfactual critic multi-agent training for scene graph generation,” in IEEE International Conference on Computer Vision, 2019, pp. 4613–4623.
  26. K. Tang, H. Zhang, B. Wu, W. Luo, and W. Liu, “Learning to compose dynamic tree structures for visual contexts,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 6619–6628.
  27. X. Lin, C. Ding, J. Zeng, and D. Tao, “Gps-net: Graph property sensing network for scene graph generation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 3746–3753.
  28. L. Li, G. Chen, J. Xiao, Y. Yang, C. Wang, and L. Chen, “Compositional feature augmentation for unbiased scene graph generation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 21 685–21 695.
  29. K. Tang, Y. Niu, J. Huang, J. Shi, and H. Zhang, “Unbiased scene graph generation from biased training,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 3716–3725.
  30. J. Yu, Y. Chai, Y. Hu, and Q. Wu, “Cogtree: Cognition tree loss for unbiased scene graph generation,” in International Joint Conferences on Artificial Intelligence, 2021.
  31. M.-J. Chiou, H. Ding, H. Yan, C. Wang, R. Zimmermann, and J. Feng, “Recovering the unbiased scene graphs from the biased ones,” in ACM International Conference on Multimedia, 2021.
  32. C. Lu, R. Krishna, M. Bernstein, and L. Fei-Fei, “Visual relationship detection with language priors,” in European Conference on Computer Vision, 2016, pp. 852–869.
  33. L. Li, L. Chen, Y. Huang, Z. Zhang, S. Zhang, and J. Xiao, “The devil is in the labels: Noisy label correction for robust scene graph generation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 18 869–18 878.
  34. H. Zhang, Z. Kyaw, S.-F. Chang, and T.-S. Chua, “Visual translation embedding network for visual relation detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2017, pp. 5532–5540.
  35. Y. Li, W. Ouyang, X. Wang, and X. Tang, “Vip-cnn: Visual phrase guided convolutional neural network,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2017, pp. 1347–1356.
  36. J. Yang, J. Lu, S. Lee, D. Batra, and D. Parikh, “Graph r-cnn for scene graph generation,” in European Conference on Computer Vision, 2018, pp. 670–685.
  37. T. Chen, W. Yu, R. Chen, and L. Lin, “Knowledge-embedded routing network for scene graph generation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 6163–6171.
  38. I. Misra, C. Lawrence Zitnick, M. Mitchell, and R. Girshick, “Seeing through the human reporting bias: Visual classifiers from noisy human-centric labels,” in CVPR, 2016, pp. 2930–2939.
  39. R. Li, S. Zhang, B. Wan, and X. He, “Bipartite graph network with adaptive message passing for unbiased scene graph generation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 11 109–11 119.
  40. T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss for dense object detection,” in IEEE International Conference on Computer Vision, 2017, pp. 2980–2988.
  41. S. Yan, C. Shen, Z. Jin, J. Huang, R. Jiang, Y. Chen, and X.-S. Hua, “Pcpl: Predicate-correlation perception learning for unbiased scene graph generation,” in ACM International Conference on Multimedia, 2020, pp. 265–273.
  42. B. Knyazev, H. de Vries, C. Cangea, G. W. Taylor, A. Courville, and E. Belilovsky, “Graph density-aware losses for novel compositions in scene graph generation,” in arXiv, 2020.
  43. A. Desai, T.-Y. Wu, S. Tripathi, and N. Vasconcelos, “Learning of visual relations: The devil is in the tails,” in IEEE International Conference on Computer Vision, 2021, pp. 15 404–15 413.
  44. Y. Guo, L. Gao, X. Wang, Y. Hu, X. Xu, X. Lu, H. T. Shen, and J. Song, “From general to specific: Informative scene graph generation via balance adjustment,” in IEEE International Conference on Computer Vision, 2021, pp. 16 383–16 392.
  45. L. Li, L. Chen, H. Shi, W. Wang, J. Shao, Y. Yang, and J. Xiao, “Label semantic knowledge distillation for unbiased scene graph generation,” arXiv preprint arXiv:2208.03763, 2022.
  46. G. Chen, L. Li, Y. Luo, and J. Xiao, “Addressing predicate overlap in scene graph generation with semantic granularity controller,” in 2023 IEEE International Conference on Multimedia and Expo (ICME).   IEEE, 2023, pp. 78–83.
  47. T.-J. J. Wang, S. Pehlivan, and J. Laaksonen, “Tackling the unannotated: Scene graph generation with bias-reduced models,” in British Machine Vision Conference, 2020.
  48. J. Goldberger and E. Ben-Reuven, “Training deep neural-networks using a noise adaptation layer,” in International Conference on Learning Representations, 2017.
  49. L. Jiang, Z. Zhou, T. Leung, L.-J. Li, and L. Fei-Fei, “Mentornet: Learning data-driven curriculum for very deep neural networks on corrupted labels,” in International Conference on Machine Learning, 2018, pp. 2304–2313.
  50. K.-H. Lee, X. He, L. Zhang, and L. Yang, “Cleannet: Transfer learning for scalable image classifier training with label noise,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, pp. 5447–5456.
  51. M. Ren, W. Zeng, B. Yang, and R. Urtasun, “Learning to reweight examples for robust deep learning,” in International Conference on Machine Learning, 2018, pp. 4334–4343.
  52. A. Vahdat, “Toward robustness against label noise in training deep discriminative neural networks,” in Conference and Workshop on Neural Information Processing Systems, 2017.
  53. Y. Li, J. Yang, Y. Song, L. Cao, J. Luo, and L.-J. Li, “Learning from noisy labels with distillation,” in IEEE International Conference on Computer Vision, 2017, pp. 1910–1918.
  54. T. Liu and D. Tao, “Classification with noisy labels by importance reweighting,” IEEE Transactions on pattern analysis and machine intelligence, vol. 38, no. 3, pp. 447–461, 2015.
  55. X. Ma, Y. Wang, M. E. Houle, S. Zhou, S. Erfani, S. Xia, S. Wijewickrema, and J. Bailey, “Dimensionality-driven learning with noisy labels,” in International Conference on Machine Learning, 2018, pp. 3355–3364.
  56. Z. Zhang and M. R. Sabuncu, “Generalized cross entropy loss for training deep neural networks with noisy labels,” in Conference and Workshop on Neural Information Processing Systems, 2018.
  57. Y. Wang, X. Ma, Z. Chen, Y. Luo, J. Yi, and J. Bailey, “Symmetric cross entropy for robust learning with noisy labels,” in IEEE International Conference on Computer Vision, 2019, pp. 322–330.
  58. Y. Xu, P. Cao, Y. Kong, and Y. Wang, “L_dmi: An information-theoretic noise-robust loss function,” in arXiv, 2019.
  59. L. Wang and K.-J. Yoon, “Knowledge distillation and student-teacher learning for visual intelligence: A review and new outlooks,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021.
  60. L. Yuan, F. E. Tay, G. Li, T. Wang, and J. Feng, “Revisiting knowledge distillation via label smoothing regularization,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 3903–3911.
  61. J. Gou, B. Yu, S. J. Maybank, and D. Tao, “Knowledge distillation: A survey,” International Journal of Computer Vision, pp. 1789–1819, 2021.
  62. T. Wang, L. Yuan, X. Zhang, and J. Feng, “Distilling object detectors with fine-grained feature imitation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 4933–4942.
  63. G. Chen, W. Choi, X. Yu, T. Han, and M. Chandraker, “Learning efficient object detection models with knowledge distillation,” Advances in neural information processing systems, vol. 30, 2017.
  64. Y. Hao, Y. Fu, Y.-G. Jiang, and Q. Tian, “An end-to-end architecture for class-incremental object detection with knowledge distillation,” in 2019 IEEE International Conference on Multimedia and Expo (ICME).   IEEE, 2019, pp. 1–6.
  65. L. Xiang, G. Ding, and J. Han, “Learning from multiple experts: Self-paced knowledge distillation for long-tailed classification,” in European Conference on Computer Vision.   Springer, 2020, pp. 247–263.
  66. S. Zhang, C. Chen, X. Hu, and S. Peng, “Balanced knowledge distillation for long-tailed learning,” arXiv preprint arXiv:2104.10510, 2021.
  67. Y.-Y. He, J. Wu, and X.-S. Wei, “Distilling virtual examples for long-tailed recognition,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 235–244.
  68. X. Li, L. Chen, W. Ma, Y. Yang, and J. Xiao, “Integrating object-aware and interaction-aware knowledge for weakly supervised scene graph generation,” in ACM International Conference on Multimedia, 2022.
  69. L. Chen, Y. Zheng, and J. Xiao, “Rethinking data augmentation for robust visual question answering,” in European Conference on Computer Vision, 2022.
  70. S. You, C. Xu, C. Xu, and D. Tao, “Learning from multiple teacher networks,” in Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2017, pp. 1285–1294.
  71. Y. Liu, W. Zhang, and J. Wang, “Adaptive multi-teacher multi-level knowledge distillation,” Neurocomputing, vol. 415, pp. 106–113, 2020.
  72. Z. Yang, L. Shou, M. Gong, W. Lin, and D. Jiang, “Model compression with two-stage multi-teacher knowledge distillation for web question answering system,” in Proceedings of the 13th International Conference on Web Search and Data Mining, 2020, pp. 690–698.
  73. X. Liang, L. Wu, J. Li, T. Qin, M. Zhang, and T.-Y. Liu, “Multi-teacher distillation with single model for neural machine translation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 992–1002, 2022.
  74. Y. Niu and H. Zhang, “Introspective distillation for robust question answering,” Advances in Neural Information Processing Systems, vol. 34, 2021.
  75. L. Chen, Y. Zheng, Y. Niu, H. Zhang, and J. Xiao, “Counterfactual samples synthesizing and training for robust visual question answering,” arXiv preprint arXiv:2110.01013, 2021.
  76. Y. Yuan, X. Lan, X. Wang, L. Chen, Z. Wang, and W. Zhu, “A closer look at temporal sentence grounding in videos: Dataset and metric,” in Proceedings of the 2nd International Workshop on Human-centric Multimedia Analysis, 2021, pp. 13–21.
  77. X. Lan, Y. Yuan, X. Wang, L. Chen, Z. Wang, L. Ma, and W. Zhu, “A closer look at debiased temporal sentence grounding in videos: Dataset, metric, and approach,” arXiv preprint arXiv:2203.05243, 2022.
  78. A. Agrawal, D. Batra, D. Parikh, and A. Kembhavi, “Don’t just assume; look and answer: Overcoming priors for visual question answering,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 4971–4980.
  79. D. Hendrycks and K. Gimpel, “A baseline for detecting misclassified and out-of-distribution examples in neural networks,” arXiv, 2016.
  80. T. DeVries and G. W. Taylor, “Learning confidence for out-of-distribution detection in neural networks,” in arXiv, 2018.
  81. A. Rodriguez and A. Laio, “Clustering by fast search and find of density peaks,” Science, pp. 1492–1496, 2014.
  82. K. Hechenbichler and K. Schliep, “Weighted k-nearest-neighbor techniques and ordinal classification,” 2004.
  83. S. Guo, W. Huang, H. Zhang, C. Zhuang, D. Dong, M. R. Scott, and D. Huang, “Curriculumnet: Weakly supervised learning from large-scale web images,” in European Conference on Computer Vision, 2018, pp. 135–150.
  84. S. Kullback and R. A. Leibler, “On information and sufficiency,” The annals of mathematical statistics, pp. 79–86, 1951.
  85. S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” in Conference and Workshop on Neural Information Processing Systems, 2015, pp. 91–99.
  86. K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv, 2014.
  87. T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie, “Feature pyramid networks for object detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2017, pp. 2117–2125.
  88. Y. Li, W. Ouyang, B. Zhou, K. Wang, and X. Wang, “Scene graph generation from objects, phrases and region captions,” in IEEE International Conference on Computer Vision, 2017, pp. 1261–1270.
  89. Z. Liu, Z. Miao, X. Zhan, J. Wang, B. Gong, and S. X. Yu, “Large-scale long-tailed recognition in an open world,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 2537–2546.
  90. X. Dong, T. Gan, X. Song, J. Wu, Y. Cheng, and L. Nie, “Stacked hybrid-attention and group collaborative learning for unbiased scene graph generation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 19 427–19 436.
  91. X. Wang, Y. Hua, E. Kodirov, D. A. Clifton, and N. M. Robertson, “Proselflc: Progressive self label correction for training robust deep neural networks,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 752–761.
  92. R. Müller, S. Kornblith, and G. E. Hinton, “When does label smoothing help?” Advances in neural information processing systems, vol. 32, 2019.
Citations (15)

Summary

We haven't generated a summary for this paper yet.