Papers
Topics
Authors
Recent
Search
2000 character limit reached

Knowledge-enhanced Multi-perspective Video Representation Learning for Scene Recognition

Published 9 Jan 2024 in cs.CV, cs.AI, and cs.MM | (2401.04354v1)

Abstract: With the explosive growth of video data in real-world applications, a comprehensive representation of videos becomes increasingly important. In this paper, we address the problem of video scene recognition, whose goal is to learn a high-level video representation to classify scenes in videos. Due to the diversity and complexity of video contents in realistic scenarios, this task remains a challenge. Most existing works identify scenes for videos only from visual or textual information in a temporal perspective, ignoring the valuable information hidden in single frames, while several earlier studies only recognize scenes for separate images in a non-temporal perspective. We argue that these two perspectives are both meaningful for this task and complementary to each other, meanwhile, externally introduced knowledge can also promote the comprehension of videos. We propose a novel two-stream framework to model video representations from multiple perspectives, i.e. temporal and non-temporal perspectives, and integrate the two perspectives in an end-to-end manner by self-distillation. Besides, we design a knowledge-enhanced feature fusion and label prediction method that contributes to naturally introducing knowledge into the task of video scene recognition. Experiments conducted on a real-world dataset demonstrate the effectiveness of our proposed method.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (31)
  1. Netvlad: CNN architecture for weakly supervised place recognition. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pages 5297–5307. IEEE Computer Society, 2016.
  2. Quo vadis, action recognition? A new model and the kinetics dataset. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pages 4724–4733. IEEE Computer Society, 2017.
  3. BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT, pages 4171–4186. Association for Computational Linguistics, 2019.
  4. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pages 770–778. IEEE Computer Society, 2016.
  5. Distilling the knowledge in a neural network. CoRR, abs/1503.02531, 2015.
  6. Hierarchical global-local temporal modeling for video captioning. In Proceedings of the ACM International Conference on Multimedia, MM, pages 774–783. ACM, 2019.
  7. Hierarchy-dependent cross-platform multi-view feature learning for venue category prediction. IEEE Trans. Multim., 21(6):1609–1619, 2019.
  8. Supervised multimodal bitransformers for classifying images and text. In Visually Grounded Interaction and Language (ViGIL), NeurIPS Workshop, 2019.
  9. K-BERT: enabling language representation with knowledge graph. In The AAAI Conference on Artificial Intelligence, AAAI , The Innovative Applications of Artificial Intelligence Conference, IAAI, The AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI, pages 2901–2908. AAAI Press, 2020.
  10. Entity-duet neural ranking: Understanding the role of knowledge graph semantics in neural information retrieval. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, ACL, pages 2395–2405. Association for Computational Linguistics, 2018.
  11. Unifying distillation and privileged information. In International Conference on Learning Representations, ICLR, Conference Track Proceedings, 2016.
  12. Decoupled weight decay regularization. In International Conference on Learning Representations, ICLR. OpenReview.net, 2019.
  13. Deep learning for scene recognition from visual data: A survey. In Hybrid Artificial Intelligent Systems, HAIS, volume 12344 of Lecture Notes in Computer Science, pages 763–773. Springer, 2020.
  14. Efficient estimation of word representations in vector space. In International Conference on Learning Representations, ICLR Workshop Track Proceedings, 2013.
  15. Spatio-temporal graph for video captioning with knowledge distillation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR, pages 10867–10876. Computer Vision Foundation / IEEE, 2020.
  16. Learning spatio-temporal representation with pseudo-3d residual networks. In IEEE International Conference on Computer Vision, ICCV, pages 5534–5542. IEEE Computer Society, 2017.
  17. Learning spatio-temporal representation with local and global diffusion. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pages 12056–12065. Computer Vision Foundation / IEEE, 2019.
  18. Faster R-CNN: towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems: Annual Conference on Neural Information Processing Systems, pages 91–99, 2015.
  19. Learning semantic concepts and temporal alignment for narrated video procedural captioning. In Proceedings of the ACM International Conference on Multimedia, MM, pages 4355–4363. ACM, 2020.
  20. Conceptnet 5.5: An open multilingual graph of general knowledge. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, pages 4444–4451. AAAI Press, 2017.
  21. Jianlin Su. Extend ’Softmax+Cross Entropy’ to Multi-label Classification Problem, 2020.
  22. Circle loss: A unified perspective of pair similarity optimization. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR, pages 6397–6406. Computer Vision Foundation / IEEE, 2020.
  23. Attention is all you need. In Advances in Neural Information Processing Systems: Annual Conference on Neural Information Processing Systems, pages 5998–6008, 2017.
  24. Word-entity duet representations for document ranking. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 763–772. ACM, 2017.
  25. Explicit semantic ranking for academic search via knowledge graph embedding. In Proceedings of the International Conference on World Wide Web, WWW, pages 1271–1279. ACM, 2017.
  26. Spatiotemporal modeling for video summarization using convolutional recurrent neural network. IEEE Access, 7:64676–64685, 2019.
  27. Poet: Product-oriented video captioner for e-commerce. In Proceedings of the ACM International Conference on Multimedia, MM, pages 1292–1301. ACM, 2020.
  28. Hybrid-attention enhanced two-stream fusion network for video venue prediction. IEEE Trans. Multim., 23:2917–2929, 2021.
  29. Object relational graph with teacher-recommended learning for video captioning. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR, pages 13275–13285. Computer Vision Foundation / IEEE, 2020.
  30. Places: A 10 million image database for scene recognition. IEEE Trans. Pattern Anal. Mach. Intell., 40(6):1452–1464, 2018.
  31. Knowledge perceived multi-modal pretraining in e-commerce. In Proceedings of the ACM International Conference on Multimedia, MM, pages 2744–2752. ACM, 2021.

Summary

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.