Learning Video Representations without Natural Videos (2410.24213v2)
Abstract: We show that useful video representations can be learned from synthetic videos and natural images, without incorporating natural videos in the training. We propose a progression of video datasets synthesized by simple generative processes, that model a growing set of natural video properties (e.g., motion, acceleration, and shape transformations). The downstream performance of video models pre-trained on these generated datasets gradually increases with the dataset progression. A VideoMAE model pre-trained on our synthetic videos closes 97.2\% of the performance gap on UCF101 action classification between training from scratch and self-supervised pre-training from natural videos, and outperforms the pre-trained model on HMDB51. Introducing crops of static images to the pre-training stage results in similar performance to UCF101 pre-training and outperforms the UCF101 pre-trained model on 11 out of 14 out-of-distribution datasets of UCF101-P. Analyzing the low-level properties of the datasets, we identify correlations between frame diversity, frame similarity to natural data, and downstream performance. Our approach provides a more controllable and transparent alternative to video data curation processes for pre-training.
- On the effectiveness of vit features as local semantic descriptors. In Computer Vision – ECCV 2022 Workshops: Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part IV, pp. 39–55, Berlin, Heidelberg, 2023. Springer-Verlag. ISBN 978-3-031-25068-2. doi: 10.1007/978-3-031-25069-9˙3. URL https://doi.org/10.1007/978-3-031-25069-9_3.
- Learning to see by looking at noise. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan (eds.), Advances in Neural Information Processing Systems, 2021. URL https://openreview.net/forum?id=RQUl8gZnN7O.
- Procedural image programs for representation learning. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=wJwHTgIoE0P.
- Speednet: Learning the speediness in videos. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 9922–9931, 2020.
- The dead leaves model: A general tessellation modeling occlusion. Advances in Applied Probability, 38(1):31–46, 2006. ISSN 00018678. URL http://www.jstor.org/stable/20443426.
- Language models are few-shot learners. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (eds.), Advances in Neural Information Processing Systems, volume 33, pp. 1877–1901. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf.
- Dataset distillation by matching training trajectories. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022.
- Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255, 2009. doi: 10.1109/CVPR.2009.5206848.
- BERT: Pre-training of deep bidirectional transformers for language understanding. In Jill Burstein, Christy Doran, and Thamar Solorio (eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1423. URL https://aclanthology.org/N19-1423.
- Flownet: Learning optical flow with convolutional networks. In IEEE International Conference on Computer Vision (ICCV), 2015. URL http://lmb.informatik.uni-freiburg.de/Publications/2015/DFIB15.
- Data filtering networks. In The Twelfth International Conference on Learning Representations, 2024.
- Masked autoencoders as spatiotemporal learners. Advances in neural information processing systems, 35:35946–35958, 2022.
- Datacomp: In search of the next generation of multimodal datasets. Advances in Neural Information Processing Systems, 36, 2024.
- Learning video representations of human motion from synthetic data. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 20165–20175, 2022. doi: 10.1109/CVPR52688.2022.01956.
- Masked autoencoders are scalable vision learners. In Proceedings - 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 15979–15988. IEEE Computer Society, 2022. doi: 10.1109/CVPR52688.2022.01553. Publisher Copyright: © 2022 IEEE.; 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022 ; Conference date: 19-06-2022 Through 24-06-2022.
- Gans trained by a two time-scale update rule converge to a local nash equilibrium. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper_files/paper/2017/file/8a1d694707eb0fefe65871369074926d-Paper.pdf.
- Analyzing and improving the image quality of StyleGAN. In Proc. CVPR, 2020.
- The kinetics human action video dataset. arXiv preprint arXiv:1705.06950, 2017.
- How transferable are video representations based on synthetic data? In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2022. URL https://openreview.net/forum?id=lRUCfzs5Hzg.
- HMDB: a large video database for human motion recognition. In Proceedings of the International Conference on Computer Vision (ICCV), 2011.
- Deep multi-scale video prediction beyond mean square error. CoRR, abs/1511.05440, 2015. URL https://api.semanticscholar.org/CorpusID:205514.
- Shuffle and learn: unsupervised learning using temporal order verification. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14, pp. 527–544. Springer, 2016.
- Improving language understanding by generative pre-training. 2018.
- Large-scale robustness analysis of video action recognition models. In The IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023.
- Two-stream convolutional networks for action recognition in videos. In Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 1, NIPS’14, pp. 568–576, Cambridge, MA, USA, 2014. MIT Press.
- Ucf101: A dataset of 101 human actions classes from videos in the wild. ArXiv, abs/1212.0402, 2012. URL https://api.semanticscholar.org/CorpusID:7197134.
- Rethinking the inception architecture for computer vision. CoRR, abs/1512.00567, 2015. URL http://arxiv.org/abs/1512.00567.
- VideoMAE: Masked autoencoders are data-efficient learners for self-supervised video pre-training. In Advances in Neural Information Processing Systems, 2022.
- Statistics of natural image categories. Network: Computation in Neural Systems, 14(3):391–412, 2003. doi: 10.1088/0954-898X“˙14“˙3“˙302. URL https://doi.org/10.1088/0954-898X_14_3_302. PMID: 12938764.
- A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 6450–6459, 2018.
- Fvd: A new metric for video generation. In DGS@ICLR, 2019. URL https://api.semanticscholar.org/CorpusID:198489709.
- Videomae v2: Scaling video masked autoencoders with dual masking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 14549–14560, June 2023.
- Dataset distillation. arXiv preprint arXiv:1811.10959, 2018.
- Learning and using the arrow of time. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8052–8060, 2018.
- Self-supervised spatiotemporal learning via video clip order prediction. In Computer Vision and Pattern Recognition (CVPR), 2019.
- Pointodyssey: A large-scale synthetic dataset for long-term point tracking. In ICCV, 2023.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.