Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Unsupervised Modality-Transferable Video Highlight Detection with Representation Activation Sequence Learning (2403.09401v3)

Published 14 Mar 2024 in cs.CV

Abstract: Identifying highlight moments of raw video materials is crucial for improving the efficiency of editing videos that are pervasive on internet platforms. However, the extensive work of manually labeling footage has created obstacles to applying supervised methods to videos of unseen categories. The absence of an audio modality that contains valuable cues for highlight detection in many videos also makes it difficult to use multimodal strategies. In this paper, we propose a novel model with cross-modal perception for unsupervised highlight detection. The proposed model learns representations with visual-audio level semantics from image-audio pair data via a self-reconstruction task. To achieve unsupervised highlight detection, we investigate the latent representations of the network and propose the representation activation sequence learning (RASL) module with k-point contrastive learning to learn significant representation activations. To connect the visual modality with the audio modality, we use the symmetric contrastive learning (SCL) module to learn the paired visual and audio representations. Furthermore, an auxiliary task of masked feature vector sequence (FVS) reconstruction is simultaneously conducted during pretraining for representation enhancement. During inference, the cross-modal pretrained model can generate representations with paired visual-audio semantics given only the visual modality. The RASL module is used to output the highlight scores. The experimental results show that the proposed framework achieves superior performance compared to other state-of-the-art approaches.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (62)
  1. M. Gygli, Y. Song, and L. Cao, “Video2gif: Automatic generation of animated gifs from video,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 1001–1009.
  2. H. Yang, B. Wang, S. Lin, D. Wipf, M. Guo, and B. Guo, “Unsupervised extraction of video highlights via robust recurrent auto-encoders,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 4633–4641.
  3. M. Sun, A. Farhadi, and S. Seitz, “Ranking domain-specific highlights by analyzing edited videos,” in European conference on computer vision.   Springer, 2014, pp. 787–802.
  4. B. Xiong, Y. Kalantidis, D. Ghadiyaram, and K. Grauman, “Less is more: Learning highlight detection from video duration,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 1258–1267.
  5. T. Yao, T. Mei, and Y. Rui, “Highlight detection with pairwise deep ranking for first-person video summarization,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 982–990.
  6. W. Liu, T. Mei, Y. Zhang, C. Che, and J. Luo, “Multi-task deep visual-semantic embedding for video thumbnail selection,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 3707–3715.
  7. Y. Jiao, X. Yang, T. Zhang, S. Huang, and C. Xu, “Video highlight detection via deep ranking modeling,” in Pacific-Rim Symposium on Image and Video Technology.   Springer, 2017, pp. 28–39.
  8. M. Rochan, M. K. Krishna Reddy, L. Ye, and Y. Wang, “Adaptive video highlight detection by learning from user history,” in European conference on computer vision.   Springer, 2020, pp. 261–278.
  9. Q. Ye, X. Shen, Y. Gao, Z. Wang, Q. Bi, P. Li, and G. Yang, “Temporal cue guided video highlight detection with low-rank audio-visual fusion,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 7950–7959.
  10. F.-T. Hong, X. Huang, W.-H. Li, and W.-S. Zheng, “Mini-net: Multiple instance ranking network for video highlight detection,” in European Conference on Computer Vision.   Springer, 2020, pp. 345–360.
  11. A. Sharghi, B. Gong, and M. Shah, “Query-focused extractive video summarization,” in Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part VIII 14.   Springer, 2016, pp. 3–19.
  12. R. Zeng, W. Huang, M. Tan, Y. Rong, P. Zhao, J. Huang, and C. Gan, “Graph convolutional networks for temporal action localization,” in Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 7094–7103.
  13. C. Gan, N. Wang, Y. Yang, D.-Y. Yeung, and A. G. Hauptmann, “Devnet: A deep event network for multimedia event detection and evidence recounting,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 2568–2577.
  14. Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 10 012–10 022.
  15. V. Sanguineti, P. Morerio, A. Del Bue, and V. Murino, “Audio-visual localization by synthetic acoustic image generation,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 3, 2021, pp. 2523–2531.
  16. H. Akbari, L. Yuan, R. Qian, W.-H. Chuang, S.-F. Chang, Y. Cui, and B. Gong, “Vatt: Transformers for multimodal self-supervised learning from raw video, audio and text,” Advances in Neural Information Processing Systems, vol. 34, 2021.
  17. T. Li, Z. Sun, H. Zhang, J. Li, Z. Wu, H. Zhan, Y. Yu, and H. Shi, “Deep music retrieval for fine-grained videos by exploiting cross-modal-encoded voice-overs,” in Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2021, pp. 1880–1884.
  18. T. Afouras, J. S. Chung, A. Senior, O. Vinyals, and A. Zisserman, “Deep audio-visual speech recognition,” IEEE transactions on pattern analysis and machine intelligence, 2018.
  19. M.-M. Cheng, N. J. Mitra, X. Huang, P. H. Torr, and S.-M. Hu, “Global contrast based salient region detection,” IEEE transactions on pattern analysis and machine intelligence, vol. 37, no. 3, pp. 569–582, 2014.
  20. Z. Wang, D. Xiang, S. Hou, and F. Wu, “Background-driven salient object detection,” IEEE transactions on multimedia, vol. 19, no. 4, pp. 750–762, 2016.
  21. Y. Song, J. Vallmitjana, A. Stent, and A. Jaimes, “Tvsum: Summarizing web videos using titles,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 5179–5187.
  22. K. H. Jin, M. T. McCann, E. Froustey, and M. Unser, “Deep convolutional neural network for inverse problems in imaging,” IEEE Transactions on Image Processing, vol. 26, no. 9, pp. 4509–4522, 2017.
  23. K. Zhang, W. Zuo, Y. Chen, D. Meng, and L. Zhang, “Beyond a gaussian denoiser: Residual learning of deep cnn for image denoising,” IEEE transactions on image processing, vol. 26, no. 7, pp. 3142–3155, 2017.
  24. T. Li, Y.-H. Chan, and D. P. K. Lun, “Improved multiple-image-based reflection removal algorithm using deep neural networks,” IEEE Transactions on Image Processing, vol. 30, pp. 68–79, 2020.
  25. A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” in International Conference on Learning Representations (ICLR), 2021.
  26. K. He, X. Chen, S. Xie, Y. Li, P. Dollár, and R. Girshick, “Masked autoencoders are scalable vision learners,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 16 000–16 009.
  27. A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” Advances in Neural Information Processing Systems, vol. 33, pp. 12 449–12 460, 2020.
  28. C. Zhuang, T. She, A. Andonian, M. S. Mark, and D. Yamins, “Unsupervised learning from video with deep neural embeddings,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 9563–9572.
  29. H. Chen, Y. Wang, T. Guo, C. Xu, Y. Deng, Z. Liu, S. Ma, C. Xu, C. Xu, and W. Gao, “Pre-trained image processing transformer,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 12 299–12 310.
  30. A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in International Conference on Machine Learning.   PMLR, 2021, pp. 8748–8763.
  31. J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” in the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2019.
  32. H. Tang, V. Kwatra, M. E. Sargin, and U. Gargi, “Detecting highlights in sports videos: Cricket as a test case,” in 2011 IEEE International Conference on Multimedia and Expo.   IEEE, 2011, pp. 1–6.
  33. J. Wang, C. Xu, E. Chng, and Q. Tian, “Sports highlight detection from keyword sequences using hmm,” in 2004 IEEE International Conference on Multimedia and Expo (ICME)(IEEE Cat. No. 04TH8763), vol. 1.   IEEE, 2004, pp. 599–602.
  34. Z. Xiong, R. Radhakrishnan, A. Divakaran, and T. S. Huang, “Highlights extraction from sports video based on an audio-visual marker detection framework,” in 2005 IEEE International Conference on Multimedia and Expo.   IEEE, 2005, pp. 4–pp.
  35. Y. Yu, S. Lee, J. Na, J. Kang, and G. Kim, “A deep ranking model for spatio-temporal highlight detection from a 360◦ video,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32, no. 1, 2018.
  36. L. Wang, D. Liu, R. Puri, and D. N. Metaxas, “Learning trailer moments in full-length movies with co-contrastive attention,” in European Conference on Computer Vision.   Springer, 2020, pp. 300–316.
  37. B. Mahasseni, M. Lam, and S. Todorovic, “Unsupervised video summarization with adversarial lstm networks,” in Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2017, pp. 202–211.
  38. H. Kanafani, J. A. Ghauri, S. Hakimov, and R. Ewerth, “Unsupervised video summarization via multi-source features,” in Proceedings of the 2021 International Conference on Multimedia Retrieval, 2021, pp. 466–470.
  39. L. Yuan, F. E. Tay, P. Li, L. Zhou, and J. Feng, “Cycle-sum: Cycle-consistent adversarial lstm networks for unsupervised video summarization,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, no. 01, 2019, pp. 9143–9150.
  40. W.-S. Chu, Y. Song, and A. Jaimes, “Video co-summarization: Video summarization by visual co-occurrence,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 3584–3592.
  41. S. Cai, W. Zuo, L. S. Davis, and L. Zhang, “Weakly-supervised video summarization using variational encoder-decoder and web prior,” in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 184–200.
  42. E. Elhamifar, G. Sapiro, and R. Vidal, “See all by looking at a few: Sparse modeling for finding representative objects,” in 2012 IEEE conference on computer vision and pattern recognition.   IEEE, 2012, pp. 1600–1607.
  43. G. Kim, L. Sigal, and E. P. Xing, “Joint summarization of large-scale collections of web images and videos for storyline reconstruction,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 4225–4232.
  44. D. Potapov, M. Douze, Z. Harchaoui, and C. Schmid, “Category-specific video summarization,” in European conference on computer vision.   Springer, 2014, pp. 540–555.
  45. B. Gong, W.-L. Chao, K. Grauman, and F. Sha, “Diverse sequential subset selection for supervised video summarization,” Advances in neural information processing systems, vol. 27, 2014.
  46. K. Zhang, W.-L. Chao, F. Sha, and K. Grauman, “Video summarization with long short-term memory,” in European conference on computer vision.   Springer, 2016, pp. 766–782.
  47. R. Panda, A. Das, Z. Wu, J. Ernst, and A. K. Roy-Chowdhury, “Weakly supervised summarization of web videos,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 3657–3666.
  48. R. Panda and A. K. Roy-Chowdhury, “Collaborative summarization of topic-related videos,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 7083–7092.
  49. A. Graves, “Long short-term memory,” Supervised sequence labelling with recurrent neural networks, pp. 37–45, 2012.
  50. A. Radford, L. Metz, and S. Chintala, “Unsupervised representation learning with deep convolutional generative adversarial networks,” 2016.
  51. M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein generative adversarial networks,” in International conference on machine learning.   PMLR, 2017, pp. 214–223.
  52. T. Li and D. P. K. Lun, “Image reflection removal using the wasserstein generative adversarial network,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2019, pp. 1–5.
  53. J. Baxter, “A bayesian/information theoretic model of learning to learn via multiple task sampling,” Machine learning, vol. 28, no. 1, pp. 7–39, 1997.
  54. L. Duong, T. Cohn, S. Bird, and P. Cook, “Low resource dependency parsing: Cross-lingual parameter sharing in a neural network parser,” in Proceedings of the 53rd annual meeting of the Association for Computational Linguistics and the 7th international joint conference on natural language processing (volume 2: short papers), 2015, pp. 845–850.
  55. Y. Yang and T. M. Hospedales, “Trace norm regularised deep multi-task learning,” arXiv preprint arXiv:1606.04038, 2016.
  56. K. Weiss, T. M. Khoshgoftaar, and D. Wang, “A survey of transfer learning,” Journal of Big data, vol. 3, no. 1, pp. 1–40, 2016.
  57. A. T. Liu, S.-w. Yang, P.-H. Chi, P.-c. Hsu, and H.-y. Lee, “Mockingjay: Unsupervised speech representation learning with deep bidirectional transformer encoders,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2020, pp. 6419–6423.
  58. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
  59. A. Graves, “Generating sequences with recurrent neural networks,” arXiv preprint arXiv:1308.0850, 2013.
  60. F. Caba Heilbron, V. Escorcia, B. Ghanem, and J. Carlos Niebles, “Activitynet: A large-scale video benchmark for human activity understanding,” in Proceedings of the ieee conference on computer vision and pattern recognition, 2015, pp. 961–970.
  61. M. Gygli, H. Grabner, and L. Van Gool, “Video summarization by learning submodular mixtures of objectives,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 3090–3098.
  62. T. Badamdorj, M. Rochan, Y. Wang, and L. Cheng, “Contrastive learning for unsupervised video highlight detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 14 042–14 052.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Tingtian Li (5 papers)
  2. Zixun Sun (10 papers)
  3. Xinyu Xiao (11 papers)
Citations (2)