Efficient Movie Scene Detection using State-Space Transformers (2212.14427v2)
Abstract: The ability to distinguish between different movie scenes is critical for understanding the storyline of a movie. However, accurately detecting movie scenes is often challenging as it requires the ability to reason over very long movie segments. This is in contrast to most existing video recognition models, which are typically designed for short-range video analysis. This work proposes a State-Space Transformer model that can efficiently capture dependencies in long movie videos for accurate movie scene detection. Our model, dubbed TranS4mer, is built using a novel S4A building block, which combines the strengths of structured state-space sequence (S4) and self-attention (A) layers. Given a sequence of frames divided into movie shots (uninterrupted periods where the camera position does not change), the S4A block first applies self-attention to capture short-range intra-shot dependencies. Afterward, the state-space operation in the S4A block is used to aggregate long-range inter-shot cues. The final TranS4mer model, which can be trained end-to-end, is obtained by stacking the S4A blocks one after the other multiple times. Our proposed TranS4mer outperforms all prior methods in three movie scene detection datasets, including MovieNet, BBC, and OVSD, while also being $2\times$ faster and requiring $3\times$ less GPU memory than standard Transformer models. We will release our code and models.
- Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv: 1409.0473, 2014.
- Analysis and re-use of videos in educational digital libraries with automatic scene detection. In Italian Research Conference on Digital Libraries, pages 155–164. Springer, 2015.
- A deep siamese network for scene detection in broadcast videos. In Proceedings of the 23rd ACM international conference on Multimedia, pages 1199–1202, 2015.
- Shot and scene detection via hierarchical clustering for re-using broadcast video. In International conference on computer analysis of images and patterns, pages 801–811. Springer, 2015.
- BBC. Planet earth. https://www.bbc.co.uk/programmes/b006mywy.
- Is space-time attention all you need for video understanding? In Proceedings of the International Conference on Machine Learning (ICML), July 2021.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- End-to-end object detection with transformers. In European conference on computer vision, pages 213–229. Springer, 2020.
- Brandon Castellano. Pyscenedetect: Intelligent scene cut detection and video splitting tool, 2018.
- Scene detection in videos using shot clustering and sequence alignment. IEEE transactions on multimedia, 11(1):89–100, 2008.
- Shot contrastive self-supervised learning for scene boundary detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9796–9805, 2021.
- Rethinking attention with performers. arXiv preprint arXiv:2009.14794, 2020.
- Video shot detection and condensed representation. a review. IEEE signal processing magazine, 23(2):28–37, 2006.
- Transformer-xl: Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860, 2019.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
- It’s raw! audio generation with state-space models. International Conference on Machine Learning (ICML), 2022.
- Efficiently modeling long sequences with structured state spaces. arXiv preprint arXiv:2111.00396, 2021.
- Combining recurrent, convolutional, and continuous-time models with linear state space layers. In Thirty-Fifth Conference on Neural Information Processing Systems, 2021.
- Ankit Gupta. Diagonal state spaces are as effective as structured state spaces. arXiv preprint arXiv:2203.14343, 2022.
- Video scene segmentation using a novel boundary evaluation criterion and dynamic programming. In 2011 IEEE International conference on multimedia and expo, pages 1–6. IEEE, 2011.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
- Bridging nonlinearities and stochastic regularizers with gaussian error linear units. CoRR, abs/1606.08415, 3, 2016.
- Transformer quality in linear time. In International Conference on Machine Learning, pages 9099–9117. PMLR, 2022.
- Movienet: A holistic dataset for movie understanding. In European Conference on Computer Vision, pages 709–727. Springer, 2020.
- Transformers are rnns: Fast autoregressive transformers with linear attention. In International Conference on Machine Learning, pages 5156–5165. PMLR, 2020.
- Ephraim Katz. The film encyclopedia. Thomas Y. Crowell, 1979.
- Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Reformer: The efficient transformer. arXiv preprint arXiv:2001.04451, 2020.
- The language of actions: Recovering the syntax and semantics of goal-directed human activities. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 780–787, 2014.
- Learning to recognize procedural activities with distant supervision. arXiv preprint arXiv:2201.10990, 2022.
- Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
- Long range language modeling via gated state spaces. arXiv preprint arXiv:2206.13947, 2022.
- Long movie clip classification with state-space video models. European Conference on Computer Vision (ECCV), 2022.
- Boundary-aware self-supervised learning for video scene segmentation. arXiv preprint arXiv:2201.05277, 2022.
- Keeping your eye on the ball: Trajectory attention in video transformers. Advances in neural information processing systems, 34:12493–12506, 2021.
- Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683, 2019.
- A local-to-global approach to multi-modal movie scene segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10146–10155, 2020.
- Detection and representation of scenes in videos. IEEE transactions on Multimedia, 7(6):1097–1105, 2005.
- Robust and efficient video scene detection using optimal sequential grouping. In 2016 IEEE international symposium on multimedia (ISM), pages 275–280. IEEE, 2016.
- Constructing table-of-content for videos. Multimedia systems, 7(5):359–368, 1999.
- Temporal video segmentation to scenes using high-level audiovisual features. IEEE Transactions on Circuits and Systems for Video Technology, 21(8):1163–1177, 2011.
- Coin: A large-scale dataset for comprehensive instructional video analysis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1207–1216, 2019.
- Comprehensive instructional video analysis: The coin dataset and performance evaluation. IEEE transactions on pattern analysis and machine intelligence, 43(9):3138–3153, 2020.
- Storygraphs: visualizing character interactions as a timeline. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 827–834, 2014.
- Robert sklar. film, an international history of the medium, 1993;; kristin thompson, david bordwell. film history, an introduction, 1994. 1895, revue d’histoire du cinéma, 17(1):170–170, 1994.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Towards long-form video understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1884–1894, 2021.
- Xlnet: Generalized autoregressive pretraining for language understanding. Advances in neural information processing systems, 32, 2019.
- Big bird: Transformers for longer sequences. Advances in Neural Information Processing Systems, 33:17283–17297, 2020.
- Graph-based high-order relation modeling for long-term action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8984–8993, 2021.
- Constructing table-of-content for videos. In Exploration of Visual Data, pages 53–73. Springer, 2003.