CSTA: CNN-based Spatiotemporal Attention for Video Summarization (2405.11905v2)
Abstract: Video summarization aims to generate a concise representation of a video, capturing its essential content and key moments while reducing its overall length. Although several methods employ attention mechanisms to handle long-term dependencies, they often fail to capture the visual significance inherent in frames. To address this limitation, we propose a CNN-based SpatioTemporal Attention (CSTA) method that stacks each feature of frames from a single video to form image-like frame representations and applies 2D CNN to these frame features. Our methodology relies on CNN to comprehend the inter and intra-frame relations and to find crucial attributes in videos by exploiting its ability to learn absolute positions within images. In contrast to previous work compromising efficiency by designing additional modules to focus on spatial importance, CSTA requires minimal computational overhead as it uses CNN as a sliding window. Extensive experiments on two benchmark datasets (SumMe and TVSum) demonstrate that our proposed approach achieves state-of-the-art performance with fewer MACs compared to previous methods. Codes are available at https://github.com/thswodnjs3/CSTA.
- Combining global and local attention with positional encoding for video summarization. In 2021 IEEE international symposium on multimedia (ISM), pages 226–234. IEEE, 2021.
- Summarizing videos using concentrated attention and considering the uniqueness and diversity of the video frames. In Proceedings of the 2022 International Conference on Multimedia Retrieval, pages 407–415, 2022.
- Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
- Conditional positional encodings for vision transformers. In The Eleventh International Conference on Learning Representations, 2022.
- Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
- An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2020.
- Summarizing videos with attention. In Computer Vision–ACCV 2018 Workshops: 14th Asian Conference on Computer Vision, Perth, Australia, December 2–6, 2018, Revised Selected Papers 14, pages 39–54. Springer, 2019.
- Video summarization with a dual attention capsule network. In 2020 25th International Conference on Pattern Recognition (ICPR), pages 446–451. IEEE, 2021.
- Supervised video summarization via multiple feature sets with parallel attention. In 2021 IEEE International Conference on Multimedia and Expo (ICME), pages 1–6s. IEEE, 2021.
- Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pages 249–256. JMLR Workshop and Conference Proceedings, 2010.
- Cmt: Convolutional neural networks meet vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12175–12185, 2022.
- Creating summaries from user videos. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part VII 13, pages 505–520. Springer, 2014.
- Align and attend: Multimodal summarization with dual contrastive losses. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14867–14878, 2023.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
- Video summarization with spatiotemporal vision transformer. IEEE Transactions on Image Processing, 2023.
- How much position information do convolutional neural networks encode? In International Conference on Learning Representations, 2019.
- Video summarization with attention-based encoder–decoder networks. IEEE Transactions on Circuits and Systems for Video Technology, 30(6):1709–1717, 2019.
- Deep attentive video summarization with distribution consistency learning. IEEE transactions on neural networks and learning systems, 32(4):1765–1775, 2020.
- Joint video summarization and moment localization by cross-task sample transfer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16388–16398, 2022.
- Global-and-local relative position embedding for unsupervised video summarization. In European Conference on Computer Vision, pages 167–183. Springer, 2020.
- Osman Semih Kayhan and Jan C van Gemert. On translation invariance in cnns: Convolutional layers can exploit absolute spatial location. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14274–14285, 2020.
- Maurice G Kendall. The treatment of ties in ranking problems. Biometrika, 33(3):239–251, 1945.
- Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015.
- Video joint modelling based on hierarchical transformer for co-summarization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(3):3904–3917, 2022.
- Progressive video summarization via multimodal self-supervised learning. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 5584–5593, 2023.
- Exploring global diverse attention via pairwise temporal relation for video summarization. Pattern Recognition, 111:107677, 2021.
- Video summarization with a dual-path attentive network. Neurocomputing, 467:1–9, 2022.
- Unsupervised video summarization with adversarial lstm networks. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 202–211, 2017.
- Clip-it! language-guided video summarization. Advances in Neural Information Processing Systems, 34:13988–14000, 2021.
- Rethinking the evaluation of video summaries. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7596–7604, 2019.
- Category-specific video summarization. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part VI 13, pages 540–555. Springer, 2014.
- Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems, 28, 2015.
- Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4510–4520, 2018.
- Tvsum: Summarizing web videos using titles. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5179–5187, 2015.
- Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1–9, 2015.
- Efficientnet: Rethinking model scaling for convolutional neural networks. In International conference on machine learning, pages 6105–6114. PMLR, 2019.
- Multi-annotation attention model for video summarization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3142–3151, 2023.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Query twice: Dual mixture attention meta learning for video summarization. In Proceedings of the 28th ACM International Conference on Multimedia, pages 4023–4031, 2020.
- Cvt: Introducing convolutions to vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pages 22–31, 2021.
- Incorporating convolution designs into visual transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 579–588, 2021.
- Video summarization with long short-term memory. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part VII 14, pages 766–782. Springer, 2016.
- Vss-net: Visual semantic self-mining network for video summarization. IEEE Transactions on Circuits and Systems for Video Technology, 2023.
- Hierarchical recurrent neural network for video summarization. In Proceedings of the 25th ACM international conference on Multimedia, pages 863–871, 2017.
- Hsa-rnn: Hierarchical structure-adaptive rnn for video summarization. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7405–7414, 2018.
- Hierarchical multimodal transformer to summarize videos. Neurocomputing, 468:360–369, 2022.
- Dsnet: A flexible detect-to-summarize network for video summarization. IEEE Transactions on Image Processing, 30:948–962, 2020.
- Relational reasoning over spatial-temporal graphs for video summarization. IEEE Transactions on Image Processing, 31:3017–3031, 2022.
- CRC standard probability and statistics tables and formulae. Crc Press, 1999.