Papers
Topics
Authors
Recent
Search
2000 character limit reached

Answering from Sure to Uncertain: Uncertainty-Aware Curriculum Learning for Video Question Answering

Published 3 Jan 2024 in cs.CV | (2401.01510v1)

Abstract: While significant advancements have been made in video question answering (VideoQA), the potential benefits of enhancing model generalization through tailored difficulty scheduling have been largely overlooked in existing research. This paper seeks to bridge that gap by incorporating VideoQA into a curriculum learning (CL) framework that progressively trains models from simpler to more complex data. Recognizing that conventional self-paced CL methods rely on training loss for difficulty measurement, which might not accurately reflect the intricacies of video-question pairs, we introduce the concept of uncertainty-aware CL. Here, uncertainty serves as the guiding principle for dynamically adjusting the difficulty. Furthermore, we address the challenge posed by uncertainty by presenting a probabilistic modeling approach for VideoQA. Specifically, we conceptualize VideoQA as a stochastic computation graph, where the hidden representations are treated as stochastic variables. This yields two distinct types of uncertainty: one related to the inherent uncertainty in the data and another pertaining to the model's confidence. In practice, we seamlessly integrate the VideoQA model into our framework and conduct comprehensive experiments. The findings affirm that our approach not only achieves enhanced performance but also effectively quantifies uncertainty in the context of VideoQA.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (67)
  1. D. Huang, P. Chen, R. Zeng, Q. Du, M. Tan, and C. Gan, “Location-aware graph convolutional networks for video question answering,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 07, 2020, pp. 11 021–11 028.
  2. A. Seo, G.-C. Kang, J. Park, and B.-T. Zhang, “Attend what you need: Motion-appearance synergistic networks for video question answering,” in Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 2021, pp. 6167–6177.
  3. Y. Li, X. Wang, J. Xiao, W. Ji, and T.-S. Chua, “Invariant grounding for video question answering,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 2928–2937.
  4. C. Fan, X. Zhang, S. Zhang, W. Wang, C. Zhang, and H. Huang, “Heterogeneous memory enhanced multimodal attention model for video question answering,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 1999–2007.
  5. J. Gao, R. Ge, K. Chen, and R. Nevatia, “Motion-appearance co-memory networks for video question answering,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 6576–6585.
  6. P. Jiang and Y. Han, “Reasoning with heterogeneous graph alignment for video question answering,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 07, 2020, pp. 11 109–11 116.
  7. J. Jiang, Z. Liu, and N. Zheng, “Livlr: A lightweight visual-linguistic reasoning framework for video question answering,” IEEE Transactions on Multimedia, vol. 25, pp. 5002–5013, 2023.
  8. F. Zhang, R. Wang, F. Zhou, Y. Luo, and J. Li, “Psam: Parameter-free spatiotemporal attention mechanism for video question answering,” IEEE Transactions on Multimedia, pp. 1–16, 2023.
  9. W. Zhang, S. Tang, Y. Cao, S. Pu, F. Wu, and Y. Zhuang, “Frame augmented alternating attention network for video question answering,” IEEE Transactions on Multimedia, vol. 22, no. 4, pp. 1032–1041, 2020.
  10. Z. Guo, J. Zhao, L. Jiao, X. Liu, and F. Liu, “A universal quaternion hypergraph network for multimodal video question answering,” IEEE Transactions on Multimedia, vol. 25, pp. 38–49, 2023.
  11. J. Wang, B.-K. Bao, and C. Xu, “Dualvgr: A dual-visual graph reasoning unit for video question answering,” IEEE Transactions on Multimedia, vol. 24, pp. 3369–3380, 2022.
  12. T. Qian, R. Cui, J. Chen, P. Peng, X. Guo, and Y.-G. Jiang, “Locate before answering: Answer guided question localization for video question answering,” IEEE Transactions on Multimedia, pp. 1–10, 2023.
  13. J. Xiao, A. Yao, Z. Liu, Y. Li, W. Ji, and T.-S. Chua, “Video as conditional graph hierarchy for multi-granular question answering.”   AAAI, 2022.
  14. J.-H. Kim, J. Jun, and B.-T. Zhang, “Bilinear attention networks,” Advances in neural information processing systems, vol. 31, 2018.
  15. L. Yang, Y. Shen, Y. Mao, and L. Cai, “Hybrid curriculum learning for emotion recognition in conversation,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, no. 10, 2022, pp. 11 595–11 603.
  16. Z. Zhou, X. Ning, Y. Cai, J. Han, Y. Deng, Y. Dong, H. Yang, and Y. Wang, “Close: Curriculum learning on the sharing extent towards better one-shot nas,” in Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XX.   Springer, 2022, pp. 578–594.
  17. P. Soviany, R. T. Ionescu, P. Rota, and N. Sebe, “Curriculum learning: A survey,” International Journal of Computer Vision, vol. 130, no. 6, pp. 1526–1565, 2022.
  18. Y. Bengio, J. Louradour, R. Collobert, and J. Weston, “Curriculum learning,” in Proceedings of the 26th annual international conference on machine learning, 2009, pp. 41–48.
  19. M. Kumar, B. Packer, and D. Koller, “Self-paced learning for latent variable models,” Advances in neural information processing systems, vol. 23, 2010.
  20. L. Jiang, D. Meng, T. Mitamura, and A. G. Hauptmann, “Easy samples first: Self-paced reranking for zero-example multimedia search,” in Proceedings of the 22nd ACM international conference on Multimedia, 2014, pp. 547–556.
  21. Q. Zhao, D. Meng, L. Jiang, Q. Xie, Z. Xu, and A. G. Hauptmann, “Self-paced learning for matrix factorization,” in Twenty-ninth AAAI conference on artificial intelligence, 2015.
  22. M. Gong, H. Li, D. Meng, Q. Miao, and J. Liu, “Decomposition-based evolutionary multiobjective optimization to self-paced learning,” IEEE Transactions on Evolutionary Computation, vol. 23, no. 2, pp. 288–302, 2018.
  23. X. Wang, Y. Chen, and W. Zhu, “A survey on curriculum learning,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 9, pp. 4555–4576, 2021.
  24. H. Li and M. Gong, “Self-paced convolutional neural networks.” in IJCAI, 2017, pp. 2110–2116.
  25. J. Schulman, N. Heess, T. Weber, and P. Abbeel, “Gradient estimation using stochastic computation graphs,” Advances in Neural Information Processing Systems, vol. 28, 2015.
  26. M. J. Wainwright, M. I. Jordan et al., “Graphical models, exponential families, and variational inference,” Foundations and Trends® in Machine Learning, vol. 1, no. 1–2, pp. 1–305, 2008.
  27. M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K. Saul, “An introduction to variational methods for graphical models,” Machine learning, vol. 37, no. 2, pp. 183–233, 1999.
  28. B. Qin, H. Hu, and Y. Zhuang, “Deep residual weight-sharing attention network with low-rank attention for visual question answering,” IEEE Transactions on Multimedia, vol. 25, pp. 4282–4295, 2023.
  29. H. Zhong, J. Chen, C. Shen, H. Zhang, J. Huang, and X.-S. Hua, “Self-adaptive neural module transformer for visual question answering,” IEEE Transactions on Multimedia, vol. 23, pp. 1264–1273, 2021.
  30. T. Qian, J. Chen, S. Chen, B. Wu, and Y.-G. Jiang, “Scene graph refinement network for visual question answering,” IEEE Transactions on Multimedia, vol. 25, pp. 3950–3961, 2023.
  31. J. Yu, W. Zhang, Y. Lu, Z. Qin, Y. Hu, J. Tan, and Q. Wu, “Reasoning on the relation: Enhancing visual representation for visual question answering and cross-modal retrieval,” IEEE Transactions on Multimedia, vol. 22, no. 12, pp. 3196–3209, 2020.
  32. Y. Liu, W. Wei, D. Peng, X.-L. Mao, Z. He, and P. Zhou, “Depth-aware and semantic guided relational attention network for visual question answering,” IEEE Transactions on Multimedia, vol. 25, pp. 5344–5357, 2023.
  33. J. Jiang, Z. Chen, H. Lin, X. Zhao, and Y. Gao, “Divide and conquer: Question-guided spatio-temporal contextual attention for video question answering,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 07, 2020, pp. 11 101–11 108.
  34. Y. Jang, Y. Song, C. D. Kim, Y. Yu, Y. Kim, and G. Kim, “Video question answering with spatio-temporal reasoning,” International Journal of Computer Vision, vol. 127, no. 10, pp. 1385–1412, 2019.
  35. J. Wang, B.-K. Bao, and C. Xu, “Dualvgr: A dual-visual graph reasoning unit for video question answering,” IEEE Transactions on Multimedia, vol. 24, pp. 3369–3380, 2021.
  36. J. Park, J. Lee, and K. Sohn, “Bridge to answer: Structure-aware graph interaction network for video question answering,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 15 526–15 535.
  37. T. M. Le, V. Le, S. Venkatesh, and T. Tran, “Hierarchical conditional relation networks for video question answering,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 9972–9981.
  38. Y. Jang, Y. Song, Y. Yu, Y. Kim, and G. Kim, “Tgif-qa: Toward spatio-temporal reasoning in visual question answering,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 2758–2766.
  39. D. Xu, Z. Zhao, J. Xiao, F. Wu, H. Zhang, X. He, and Y. Zhuang, “Video question answering via gradually refined attention over appearance and motion,” in Proceedings of the 25th ACM international conference on Multimedia, 2017, pp. 1645–1653.
  40. S. Buch, C. Eyzaguirre, A. Gaidon, J. Wu, L. Fei-Fei, and J. C. Niebles, “Revisiting the” video” in video-language understanding,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 2917–2927.
  41. R. Zellers, X. Lu, J. Hessel, Y. Yu, J. S. Park, J. Cao, A. Farhadi, and Y. Choi, “Merlot: Multimodal neural script knowledge models,” Advances in Neural Information Processing Systems, vol. 34, pp. 23 634–23 651, 2021.
  42. T.-J. Fu, L. Li, Z. Gan, K. Lin, W. Y. Wang, L. Wang, and Z. Liu, “Violet: End-to-end video-language transformers with masked visual-token modeling,” arXiv preprint arXiv:2111.12681, 2021.
  43. Y. Zeng, X. Zhang, H. Li, J. Wang, J. Zhang, and W. Zhou, “X22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT-vlm: All-in-one pre-trained model for vision-language tasks,” arXiv preprint arXiv:2211.12402, 2022.
  44. Y. Zhou, B. Yang, D. F. Wong, Y. Wan, and L. S. Chao, “Uncertainty-aware curriculum learning for neural machine translation,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020, pp. 6934–6944.
  45. A. Der Kiureghian and O. Ditlevsen, “Aleatory or epistemic? does it matter?” Structural safety, vol. 31, no. 2, pp. 105–112, 2009.
  46. A. Kendall and Y. Gal, “What uncertainties do we need in bayesian deep learning for computer vision?” Advances in neural information processing systems, vol. 30, 2017.
  47. J. Chang, Z. Lan, C. Cheng, and Y. Wei, “Data uncertainty learning in face recognition,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 5710–5719.
  48. Y. He, C. Zhu, J. Wang, M. Savvides, and X. Zhang, “Bounding box regression with uncertainty for accurate object detection,” in Proceedings of the ieee/cvf conference on computer vision and pattern recognition, 2019, pp. 2888–2897.
  49. W. Yang, T. Zhang, X. Yu, T. Qi, Y. Zhang, and F. Wu, “Uncertainty guided collaborative training for weakly supervised temporal action detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 53–63.
  50. Y. Shi and A. K. Jain, “Probabilistic face embeddings,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 6902–6911.
  51. H. Zhou, C. Zhang, Y. Luo, Y. Chen, and C. Hu, “Embracing uncertainty: Decoupling and de-bias for robust temporal grounding,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 8445–8454.
  52. H. Guo, H. Wang, and Q. Ji, “Uncertainty-guided probabilistic transformer for complex action recognition,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 20 052–20 061.
  53. M. Chen, J. Gao, S. Yang, and C. Xu, “Dual-evidential learning for weakly-supervised temporal action localization,” in European Conference on Computer Vision.   Springer, 2022, pp. 192–208.
  54. A. Amini, W. Schwarting, A. Soleimany, and D. Rus, “Deep evidential regression,” Advances in Neural Information Processing Systems, vol. 33, pp. 14 927–14 937, 2020.
  55. M. Sensoy, L. Kaplan, and M. Kandemir, “Evidential deep learning to quantify classification uncertainty,” Advances in neural information processing systems, vol. 31, 2018.
  56. D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” arXiv preprint arXiv:1312.6114, 2013.
  57. S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in International conference on machine learning.   PMLR, 2015, pp. 448–456.
  58. M. Welling and T. N. Kipf, “Semi-supervised classification with graph convolutional networks,” in J. International Conference on Learning Representations (ICLR 2017), 2016.
  59. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
  60. S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
  61. J. Pennington, R. Socher, and C. D. Manning, “Glove: Global vectors for word representation,” in Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), 2014, pp. 1532–1543.
  62. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
  63. J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” arXiv preprint arXiv:1607.06450, 2016.
  64. D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
  65. J. Xiao, X. Shang, A. Yao, and T.-S. Chua, “Next-qa: Next phase of question-answering to explaining temporal actions,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 9777–9786.
  66. F. Liu, J. Liu, W. Wang, and H. Lu, “Hair: Hierarchical visual-semantic relational reasoning for video question answering,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 1698–1707.
  67. L. H. Dang, T. M. Le, V. Le, and T. Tran, “Hierarchical object-oriented spatio-temporal reasoning for video question answering,” arXiv preprint arXiv:2106.13432, 2021.
Citations (1)

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.