View while Moving: Efficient Video Recognition in Long-untrimmed Videos (2308.04834v2)
Abstract: Recent adaptive methods for efficient video recognition mostly follow the two-stage paradigm of "preview-then-recognition" and have achieved great success on multiple video benchmarks. However, this two-stage paradigm involves two visits of raw frames from coarse-grained to fine-grained during inference (cannot be parallelized), and the captured spatiotemporal features cannot be reused in the second stage (due to varying granularity), being not friendly to efficiency and computation optimization. To this end, inspired by human cognition, we propose a novel recognition paradigm of "View while Moving" for efficient long-untrimmed video recognition. In contrast to the two-stage paradigm, our paradigm only needs to access the raw frame once. The two phases of coarse-grained sampling and fine-grained recognition are combined into unified spatiotemporal modeling, showing great performance. Moreover, we investigate the properties of semantic units in video and propose a hierarchical mechanism to efficiently capture and reason about the unit-level and video-level temporal semantics in long-untrimmed videos respectively. Extensive experiments on both long-untrimmed and short-trimmed videos demonstrate that our approach outperforms state-of-the-art methods in terms of accuracy as well as efficiency, yielding new efficiency and accuracy trade-offs for video spatiotemporal modeling.
- Relja Arandjelovic and Andrew Zisserman. 2017. Look, listen and learn. In ICCV. 609–617.
- Ekya: Continuous learning of video analytics models on edge compute servers. In 19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22). 119–135.
- Activitynet: A large-scale video benchmark for human activity understanding. In CVPR. 961–970.
- Petros Christodoulou. 2019. Soft actor-critic for discrete action settings. arXiv preprint arXiv:1910.07207 (2019).
- Imagenet: A large-scale hierarchical image database. In CVPR. IEEE, 248–255.
- Long-term recurrent convolutional networks for visual recognition and description. In CVPR. 2625–2634.
- Foundations of human reasoning in the prefrontal cortex. Science 344, 6191 (2014), 1481–1486.
- Server-driven video streaming for deep learning inference. In Proceedings of the Annual conference of the ACM Special Interest Group on Data Communication on the applications, technologies, architectures, and protocols for computer communication. 557–570.
- Watching a small portion could be as good as watching all: Towards efficient video classification. In IJCAI.
- Christoph Feichtenhofer. 2020. X3D: Expanding Architectures for Efficient Video Recognition. In CVPR. 203–213.
- Slowfast networks for video recognition. In ICCV. 6202–6211.
- Listen to look: Action recognition by previewing audio. In CVPR. 10457–10467.
- Frameexit: Conditional early exiting for efficient video recognition. In CVPR. 15608–15618.
- Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet?. In CVPR. 6546–6555.
- Deep residual learning for image recognition. In CVPR. 770–778.
- Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation 9, 8 (1997), 1735–1780.
- Actionbytes: Learning from trimmed videos to localize actions. In CVPR. 1171–1180.
- Exploiting feature and class relationships in video categorization with regularized deep neural networks. TPAMI 40, 2 (2017), 352–364.
- The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017).
- Epic-fusion: Audio-visual temporal binding for egocentric action recognition. In ICCV. 5492–5501.
- Scsampler: Sampling salient clips from video for efficient action recognition. In ICCV. 6232–6242.
- 2d or not 2d? adaptive 3d convolution selection for efficient video recognition. In CVPR. 6155–6164.
- Tsm: Temporal shift module for efficient video understanding. In ICCV. 7083–7093.
- Teinet: Towards an efficient architecture for video recognition. In AAAI, Vol. 34. 11669–11676.
- Multi-agent actor-critic for mixed cooperative-competitive environments. NIPS 30 (2017).
- CLIP4Clip: An empirical study of CLIP for end to end video clip retrieval and captioning. Neurocomputing 508 (2022), 293–304.
- Ar-net: Adaptive frame resolution for efficient action recognition. In ECCV. Springer, 86–104.
- AdaFuse: Adaptive Temporal Fusion Network for Efficient Action Recognition. In ICLR.
- Deepgame: Efficient video encoding for cloud gaming. In Proceedings of the 29th ACM International Conference on Multimedia. 1387–1395.
- Auto-captions on GIF: A Large-scale Video-sentence Dataset for Vision-language Pre-training. In Proceedings of the 30th ACM International Conference on Multimedia. 7070–7074.
- Adamml: Adaptive multi-modal learning for efficient video recognition. In ICCV. 7576–7585.
- Tiny Video Networks. arXiv preprint arXiv:1910.06961 (2019).
- Learning spatio-temporal representation with pseudo-3d residual networks. In ICCV. 5533–5541.
- Deep auto-encoder with neural response. arXiv preprint arXiv:2111.15309 (2021).
- Charan Ranganath and Robert S Blumenfeld. 2005. Doubts about double dissociations between short-and long-term memory. Trends in cognitive sciences 9, 8 (2005), 374–380.
- Karen Simonyan and Andrew Zisserman. 2014. Two-stream convolutional networks for action recognition in videos. In NIPS. 568–576.
- Dynamic network quantization for efficient video inference. In ICCV. 7375–7385.
- Learning spatiotemporal features with 3d convolutional networks. In ICCV. 4489–4497.
- A closer look at spatiotemporal convolutions for action recognition. In CVPR. 6450–6459.
- Long-term temporal convolutions for action recognition. TPAMI 40, 6 (2017), 1510–1517.
- Attention is all you need. NIPS 30 (2017).
- Edge computing enabled video segmentation for real-time traffic monitoring in internet of vehicles. Pattern Recognition 121 (2022), 108146.
- Heng Wang and Cordelia Schmid. 2013. Action recognition with improved trajectories. In ICCV. 3551–3558.
- Tdn: Temporal difference networks for efficient action recognition. In CVPR. 1895–1904.
- Temporal segment networks: Towards good practices for deep action recognition. In ECCV. Springer, 20–36.
- Disentangled representation learning for text-video retrieval. arXiv preprint arXiv:2203.07111 (2022).
- Non-local neural networks. In CVPR. 7794–7803.
- Adaptive Focus for Efficient Video Recognition. In ICCV.
- AdaFocus V2: End-to-End Training of Spatial Dynamic Networks for Video Recognition. In CVPR.
- AdaFocusV3: On Unified Spatial-Temporal Dynamic Video Recognition. In ECCV. Springer, 226–243.
- Long-term feature banks for detailed video understanding. In CVPR. 284–293.
- Memvit: Memory-augmented multiscale vision transformer for efficient long-term video recognition. In CVPR. 13587–13597.
- Mvfnet: Multi-view fusion network for efficient video recognition. In AAAI, Vol. 35. 2943–2951.
- Multi-agent reinforcement learning based frame sampling for effective untrimmed video recognition. In ICCV. 6222–6231.
- LiteEval: A Coarse-to-Fine Framework for Resource Efficient Video Recognition. In NIPS. 7778–7787.
- Adaframe: Adaptive frame selection for fast video recognition. In CVPR. 1278–1287.
- Temporal saliency query network for efficient video recognition. In ECCV. Springer, 741–759.
- Nsnet: Non-saliency suppression sampler for efficient video recognition. In ECCV. Springer, 705–723.
- Micro-video Popularity Prediction via Multimodal Variational Information Bottleneck. IEEE Transactions on Multimedia (2021).
- Hierarchical Spiking-Based Model for Efficient Image Classification With Enhanced Feature Extraction and Encoding. IEEE Transactions on Neural Networks and Learning Systems (2022).
- Large-scale personalized video game recommendation via social-aware contextualized graph neural network. In Proceedings of the ACM Web Conference 2022. 3376–3386.
- Temporally efficient vision transformer for video instance segmentation. In CVPR. 2885–2895.
- End-to-end learning of action detection from frame glimpses in videos. In CVPR. 2678–2687.
- Cross-modal and hierarchical modeling of video and text. In ECCV. 374–390.
- Look More but Care Less in Video Recognition. arXiv preprint arXiv:2211.09992 (2022).
- Eco: Efficient convolutional network for online video understanding. In ECCV. 695–712.
- Ye Tian (190 papers)
- Mengyu Yang (8 papers)
- Lanshan Zhang (3 papers)
- Zhizhen Zhang (8 papers)
- Yang Liu (2253 papers)
- Xiaohui Xie (84 papers)
- Xirong Que (1 paper)
- Wendong Wang (63 papers)