Representation Recycling for Streaming Video Analysis (2204.13492v4)
Abstract: We present StreamDEQ, a method that aims to infer frame-wise representations on videos with minimal per-frame computation. Conventional deep networks do feature extraction from scratch at each frame in the absence of ad-hoc solutions. We instead aim to build streaming recognition models that can natively exploit temporal smoothness between consecutive video frames. We observe that the recently emerging implicit layer models provide a convenient foundation to construct such models, as they define representations as the fixed-points of shallow networks, which need to be estimated using iterative methods. Our main insight is to distribute the inference iterations over the temporal axis by using the most recent representation as a starting point at each frame. This scheme effectively recycles the recent inference computations and greatly reduces the needed processing time. Through extensive experimental analysis, we show that StreamDEQ is able to recover near-optimal representations in a few frames' time and maintain an up-to-date representation throughout the video duration. Our experiments on video semantic segmentation, video object detection, and human pose estimation in videos show that StreamDEQ achieves on-par accuracy with the baseline while being more than 2-4x faster.
- K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” 2016, pp. 770–778.
- G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely connected convolutional networks,” 2017, pp. 4700–4708.
- Q. Zhou, X. Li, L. He, Y. Yang, G. Cheng, Y. Tong, L. Ma, and D. Tao, “Transvod: end-to-end video object detection with spatial-temporal transformers,” 2022.
- H. Wang, J. Tang, X. Liu, S. Guan, R. Xie, and L. Song, “Ptseformer: Progressive temporal-spatial enhanced transformer towards video object detection.” Springer, 2022, pp. 732–747.
- H. Duan, Y. Zhao, K. Chen, D. Lin, and B. Dai, “Revisiting skeleton-based action recognition,” 2022, pp. 2969–2978.
- G. Ponimatkin, N. Samet, Y. Xiao, Y. Du, R. Marlet, and V. Lepetit, “A simple and powerful global optimization for unsupervised video object segmentation,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2023, pp. 5892–5903.
- X. Zhu, Y. Xiong, J. Dai, L. Yuan, and Y. Wei, “Deep feature flow for video recognition,” 2017, pp. 2349–2358.
- X. Zhu, J. Dai, L. Yuan, and Y. Wei, “Towards high performance video object detection,” 2018, pp. 7210–7218.
- Y.-S. Xu, T.-J. Fu, H.-K. Yang, and C.-Y. Lee, “Dynamic video segmentation network,” 2018, pp. 6556–6565.
- M. Liu, M. Zhu, M. White, Y. Li, and D. Kalenichenko, “Looking fast and slow: Memory-guided mobile video object detection,” arXiv:1903.10172, 2019.
- K. Kang, H. Li, J. Yan, X. Zeng, B. Yang, T. Xiao, C. Zhang, Z. Wang, R. Wang, X. Wang et al., “T-cnn: Tubelets with convolutional neural networks for object detection from videos,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 28, no. 10, pp. 2896–2907, 2017.
- S. Wang, Y. Zhou, J. Yan, and Z. Deng, “Fully motion-aware network for video object detection,” 2018, pp. 542–557.
- V. Mnih, N. Heess, A. Graves, and K. Kavukcuoglu, “Recurrent models of visual attention,” Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K. Q. Weinberger, Eds., vol. 27. Curran Associates, Inc., 2014.
- J. Ba, V. Mnih, and K. Kavukcuoglu, “Multiple object recognition with visual attention,” 2015.
- J.-B. Cordonnier, A. Mahendran, A. Dosovitskiy, D. Weissenborn, J. Uszkoreit, and T. Unterthiner, “Differentiable patch selection for image recognition,” 2021, pp. 2351–2360.
- S. Bai, J. Z. Kolter, and V. Koltun, “Deep equilibrium models,” 2019.
- S. Bai, V. Koltun, and J. Z. Kolter, “Multiscale deep equilibrium models,” 2020.
- C. G. Broyden, “A class of methods for solving nonlinear simultaneous equations,” Mathematics of computation, vol. 19, no. 92, pp. 577–593, 1965.
- B. K. Horn and B. G. Schunck, “Determining optical flow,” Artificial intelligence, vol. 17, no. 1-3, pp. 185–203, 1981.
- W. Han, P. Khorrami, T. L. Paine, P. Ramachandran, M. Babaeizadeh, H. Shi, J. Li, S. Yan, and T. S. Huang, “Seq-nms for video object detection,” arXiv:1602.08465, 2016.
- K. Kang, W. Ouyang, H. Li, and X. Wang, “Object detection from video tubelets with convolutional neural networks,” 2016, pp. 817–825.
- Y. Lu, C. Lu, and C.-K. Tang, “Online video object detection using association lstm,” 2017, pp. 2344–2352.
- C. U. Ertenli, E. Akbas, and R. G. Cinbis, “Streaming multiscale deep equilibrium models.” Springer, 2022, pp. 189–205.
- J. Carreira, V. Patraucean, L. Mazare, A. Zisserman, and S. Osindero, “Massively parallel video networks,” 2018, pp. 649–666.
- D. Narayanan, A. Harlap, A. Phanishayee, V. Seshadri, N. R. Devanur, G. R. Ganger, P. B. Gibbons, and M. Zaharia, “Pipedream: generalized pipeline parallelism for dnn training,” in Proceedings of the 27th ACM Symposium on Operating Systems Principles, 2019, pp. 1–15.
- M. Li, Y.-X. Wang, and D. Ramanan, “Towards streaming perception.” Springer, 2020, pp. 473–488.
- A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam, “Mobilenets: Efficient convolutional neural networks for mobile vision applications,” arXiv:1704.04861, 2017.
- M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, “Mobilenetv2: Inverted residuals and linear bottlenecks,” 2018, pp. 4510–4520.
- M. Liu and M. Zhu, “Mobile video object detection with temporally-aware feature maps,” 2018, pp. 5686–5695.
- B. Zhao, B. Zhao, L. Tang, Y. Han, and W. Wang, “Deep spatial-temporal joint feature representation for video object detection,” Sensors, vol. 18, no. 3, p. 774, 2018.
- H. Zhu, H. Wei, B. Li, X. Yuan, and N. Kehtarnavaz, “A review of video object detection: Datasets, metrics and methods,” Applied Sciences, vol. 10, no. 21, p. 7834, 2020.
- R. Gadde, V. Jampani, and P. V. Gehler, “Semantic video cnns through representation warping,” 2017, pp. 4453–4462.
- P.-Y. Huang, W.-T. Hsu, C.-Y. Chiu, T.-F. Wu, and M. Sun, “Efficient uncertainty estimation for semantic segmentation in videos,” 2018, pp. 520–535.
- S. Jain, X. Wang, and J. E. Gonzalez, “Accel: A corrective fusion network for efficient semantic segmentation on video,” 2019, pp. 8866–8875.
- E. Shelhamer, K. Rakelly, J. Hoffman, and T. Darrell, “Clockwork convnets for video semantic segmentation.” Springer, 2016, pp. 852–868.
- Y. Li, J. Shi, and D. Lin, “Low-latency video semantic segmentation,” 2018, pp. 5997–6005.
- F. Liang, T.-W. Chin, Y. Zhou, and D. Marculescu, “Ant: Adapt network across time for efficient video processing,” 2022, pp. 2603–2608.
- P. Hu, F. Caba, O. Wang, Z. Lin, S. Sclaroff, and F. Perazzi, “Temporally distributed networks for fast video semantic segmentation,” 2020, pp. 8818–8827.
- Y. Liu, C. Shen, C. Yu, and J. Wang, “Efficient semantic video segmentation with per-frame inference.” Springer, 2020, pp. 352–368.
- ——, “Efficient video segmentation models with per-frame inference,” arXiv:2202.12427, 2022.
- A. Habibian, H. Ben Yahia, D. Abati, E. Gavves, and F. Porikli, “Delta distillation for efficient video processing.” Springer, 2022, pp. 213–229.
- X. Zhu, Y. Wang, J. Dai, L. Yuan, and Y. Wei, “Flow-guided feature aggregation for video object detection,” 2017, pp. 408–417.
- G. Bertasius, L. Torresani, and J. Shi, “Object detection in video with spatiotemporal sampling networks,” 2018, pp. 331–346.
- H. Wu, Y. Chen, N. Wang, and Z. Zhang, “Sequence level semantics aggregation for video object detection,” 2019, pp. 9217–9225.
- Y. Chen, Y. Cao, H. Hu, and L. Wang, “Memory enhanced global-local aggregation for video object detection,” 2020, pp. 10 337–10 346.
- J. Martinez, R. Hossain, J. Romero, and J. J. Little, “A simple yet effective baseline for 3d human pose estimation,” 2017, pp. 2640–2649.
- G. Gkioxari, A. Toshev, and N. Jaitly, “Chained predictions using convolutional neural networks.” Springer, 2016, pp. 728–743.
- M. Lin, L. Lin, X. Liang, K. Wang, and H. Cheng, “Recurrent 3d pose sequence machines,” 2017, pp. 810–819.
- B. Artacho and A. Savakis, “Unipose: Unified human pose estimation in single images and videos,” 2020, pp. 7035–7044.
- T. Pfister, J. Charles, and A. Zisserman, “Flowing convnets for human pose estimation in videos,” 2015, pp. 1913–1921.
- B. Xiao, H. Wu, and Y. Wei, “Simple baselines for human pose estimation and tracking,” 2018, pp. 466–481.
- D. Pavllo, C. Feichtenhofer, D. Grangier, and M. Auli, “3d human pose estimation in video with temporal convolutions and semi-supervised training,” 2019, pp. 7753–7762.
- Z. Liu, H.-J. Wang, Z. Xu, T. Darrell, and E. Shelhamer, “Confidence adaptive anytime pixel-level recognition,” 2022.
- L. Bazzani, N. de Freitas, H. Larochelle, V. Murino, and J.-A. Ting, “Learning attentional policies for tracking and recognition in video with deep networks,” 2011.
- M. Denil, L. Bazzani, H. Larochelle, and N. de Freitas, “Learning where to attend with deep architectures for image tracking,” Neural Computation, vol. 24, no. 8, pp. 2151–2184, 2012.
- H. Rhee, D. Min, S. Hwang, B. Andreis, and S. J. Hwang, “Distortion-aware network pruning and feature reuse for real-time video segmentation,” arXiv:2206.09604, 2022.
- A. Habibian, D. Abati, T. S. Cohen, and B. E. Bejnordi, “Skip-convolutions for efficient video processing,” 2021, pp. 2695–2704.
- Y. Chai, “Patchwork: A patch-wise attention network for efficient object detection and segmentation in video streams,” 2019, pp. 3415–3424.
- B. Ehteshami Bejnordi, A. Habibian, F. Porikli, and A. Ghodrati, “Salisa: Saliency-based input sampling for efficient video object detection.” Springer, 2022, pp. 300–316.
- Z. Huang, S. Bai, and J. Z. Kolter, “(Implicit)2superscript(Implicit)2\text{(Implicit)}^{2}(Implicit) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT: Implicit layers for implicit representations,” 2021.
- S. Bai, V. Koltun, and J. Z. Kolter, “Neural deep equilibrium solvers,” 2021.
- A. Pal, A. Edelman, and C. Rackauckas, “Mixing implicit and explicit deep learning with skip deqs and infinite time neural odes (continuous deqs),” arXiv:2201.12240, 2022.
- S. Bai, Z. Geng, Y. Savani, and J. Z. Kolter, “Deep equilibrium optical flow estimation,” 2022, pp. 620–630.
- C. Lu, J. Chen, C. Li, Q. Wang, and J. Zhu, “Implicit normalizing flows,” 2021.
- L. Ma, T. Wang, B. Dong, J. Yan, X. Li, and X. Zhang, “Implicit feature refinement for instance segmentation,” in Proc. 29th ACM Int. Conf. Multimedia, 2021, pp. 3088–3096.
- T. Wang, X. Zhang, and J. Sun, “Implicit feature pyramid network for object detection,” arXiv:2012.13563, 2020.
- S. Bai, V. Koltun, and J. Z. Kolter, “Stabilizing equilibrium models by jacobian regularization,” 2021.
- Z. Geng, X.-Y. Zhang, S. Bai, Y. Wang, and Z. Lin, “On training implicit models,” vol. 34, pp. 24 247–24 260, 2021.
- S. W. Fung, H. Heaton, Q. Li, D. McKenzie, S. Osher, and W. Yin, “Jfb: Jacobian-free backpropagation for implicit networks,” 2022.
- S. Bai, J. Z. Kolter, and V. Koltun, “Trellis networks for sequence modeling,” 2019.
- M. Dehghani, S. Gouws, O. Vinyals, J. Uszkoreit, and L. Kaiser, “Universal transformers,” 2019.
- O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei, “ImageNet Large Scale Visual Recognition Challenge,” vol. 115, no. 3, pp. 211–252, 2015.
- L. Van der Maaten and G. Hinton, “Visualizing data using t-sne.” Journal of machine learning research, vol. 9, no. 11, 2008.
- A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala, “Pytorch: An imperative style, high-performance deep learning library,” 2019, pp. 8024–8035.
- M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele, “The cityscapes dataset for semantic urban scene understanding,” 2016.
- K. Chen, J. Wang, J. Pang, Y. Cao, Y. Xiong, X. Li, S. Sun, W. Feng, Z. Liu, J. Xu, Z. Zhang, D. Cheng, C. Zhu, T. Cheng, Q. Zhao, B. Li, X. Lu, R. Zhu, Y. Wu, J. Dai, J. Wang, J. Shi, W. Ouyang, C. C. Loy, and D. Lin, “MMDetection: Open mmlab detection toolbox and benchmark,” arXiv:1906.07155, 2019.
- M. Contributors, “MMTracking: OpenMMLab video perception toolbox and benchmark,” https://github.com/open-mmlab/mmtracking, 2020.
- J. Deng, Y. Pan, T. Yao, W. Zhou, H. Li, and T. Mei, “Relation distillation networks for video object detection,” 2019, pp. 7023–7032.
- S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” vol. 28, 2015.
- T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie, “Feature pyramid networks for object detection,” 2017, pp. 2117–2125.
- M. Andriluka, L. Pishchulin, P. Gehler, and B. Schiele, “2d human pose estimation: New benchmark and state of the art analysis,” June 2014.
- M. Contributors, “Openmmlab pose estimation toolbox and benchmark,” https://github.com/open-mmlab/mmpose, 2020.
- D. C. Luvizon, D. Picard, and H. Tabia, “2d/3d pose estimation and action recognition using multitask deep learning,” 2018, pp. 5137–5146.
- K. Sun, B. Xiao, D. Liu, and J. Wang, “Deep high-resolution representation learning for human pose estimation,” 2019, pp. 5693–5703.
- J. Wang, K. Sun, T. Cheng, B. Jiang, C. Deng, Y. Zhao, D. Liu, Y. Mu, M. Tan, X. Wang et al., “Deep high-resolution representation learning for visual recognition,” vol. 43, no. 10, pp. 3349–3364, 2020.