Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Learning Efficient Video Representation with Video Shuffle Networks (1911.11319v1)

Published 26 Nov 2019 in cs.CV

Abstract: 3D CNN shows its strong ability in learning spatiotemporal representation in recent video recognition tasks. However, inflating 2D convolution to 3D inevitably introduces additional computational costs, making it cumbersome in practical deployment. We consider whether there is a way to equip the conventional 2D convolution with temporal vision no requiring expanding its kernel. To this end, we propose the video shuffle, a parameter-free plug-in component that efficiently reallocates the inputs of 2D convolution so that its receptive field can be extended to the temporal dimension. In practical, video shuffle firstly divides each frame feature into multiple groups and then aggregate the grouped features via temporal shuffle operation. This allows the following 2D convolution aggregate the global spatiotemporal features. The proposed video shuffle can be flexibly inserted into popular 2D CNNs, forming the Video Shuffle Networks (VSN). With a simple yet efficient implementation, VSN performs surprisingly well on temporal modeling benchmarks. In experiments, VSN not only gains non-trivial improvements on Kinetics and Moments in Time, but also achieves state-of-the-art performance on Something-Something-V1, Something-Something-V2 datasets.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Pingchuan Ma (91 papers)
  2. Yao Zhou (72 papers)
  3. Yu Lu (146 papers)
  4. Wei Zhang (1489 papers)
Citations (6)

Summary

We haven't generated a summary for this paper yet.