Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
158 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

ViC-MAE: Self-Supervised Representation Learning from Images and Video with Contrastive Masked Autoencoders (2303.12001v3)

Published 21 Mar 2023 in cs.CV

Abstract: We propose ViC-MAE, a model that combines both Masked AutoEncoders (MAE) and contrastive learning. ViC-MAE is trained using a global featured obtained by pooling the local representations learned under an MAE reconstruction loss and leveraging this representation under a contrastive objective across images and video frames. We show that visual representations learned under ViC-MAE generalize well to both video and image classification tasks. Particularly, ViC-MAE obtains state-of-the-art transfer learning performance from video to images on Imagenet-1k compared to the recently proposed OmniMAE by achieving a top-1 accuracy of 86% (+1.3% absolute improvement) when trained on the same data and 87.1% (+2.4% absolute improvement) when training on extra data. At the same time ViC-MAE outperforms most other methods on video benchmarks by obtaining 75.9% top-1 accuracy on the challenging Something something-v2 video benchmark . When training on videos and images from a diverse combination of datasets, our method maintains a balanced transfer-learning performance between video and image classification benchmarks, coming only as a close second to the best supervised method.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (82)
  1. Learning to see by moving. In Proceedings of the IEEE international conference on computer vision, pages 37–45, 2015.
  2. Vivit: A video vision transformer. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6836–6846, 2021.
  3. Masked siamese networks for label-efficient learning. In European Conference on Computer Vision, pages 456–473. Springer, 2022.
  4. Beit: Bert pre-training of image transformers. In International Conference on Learning Representations, 2021.
  5. Vicreg: Variance-invariance-covariance regularization for self-supervised learning. In ICLR 2022-International Conference on Learning Representations, 2022.
  6. Birdsnap: Large-scale fine-grained visual categorization of birds. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2011–2018, 2014.
  7. Is space-time attention all you need for video understanding? In International Conference on Machine Learning, pages 813–824. PMLR, 2021.
  8. Food-101–mining discriminative components with random forests. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part VI 13, pages 446–461. Springer, 2014.
  9. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9650–9660, 2021.
  10. Generative pretraining from pixels. In International conference on machine learning, pages 1691–1703. PMLR, 2020a.
  11. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pages 1597–1607. PMLR, 2020b.
  12. Big self-supervised models are strong semi-supervised learners. Advances in neural information processing systems, 33:22243–22255, 2020c.
  13. Exploring simple siamese representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15750–15758, 2021.
  14. Improved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297, 2020d.
  15. An empirical study of training self-supervised vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9640–9649, 2021.
  16. Describing textures in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3606–3613, 2014.
  17. Electra: Pre-training text encoders as discriminators rather than generators. In International Conference on Learning Representations, 2019.
  18. Randaugment: Practical automated data augmentation with a reduced search space. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pages 702–703, 2020.
  19. Scaling vision transformers to 22 billion parameters. In International Conference on Machine Learning, pages 7480–7512. PMLR, 2023.
  20. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
  21. Dynamonet: Dynamic action and motion network. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6192–6201, 2019.
  22. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2020.
  23. Multiscale vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6824–6835, 2021.
  24. Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. In 2004 conference on computer vision and pattern recognition workshop, pages 178–178. IEEE, 2004.
  25. A large-scale study on unsupervised spatiotemporal representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3299–3309, 2021.
  26. Masked autoencoders as spatiotemporal learners. Neural Information Processing Systems (NeurIPS), 2022.
  27. Omnivore: A single model for many visual modalities. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16102–16112, 2022.
  28. Imagebind: One embedding space to bind them all. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15180–15190, 2023a.
  29. Omnimae: Single model masked pretraining on images and videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10406–10417, 2023b.
  30. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pages 249–256. JMLR Workshop and Conference Proceedings, 2010.
  31. Watching the world go by: Representation learning from unlabeled videos. arXiv preprint arXiv:2003.07990, 2020.
  32. Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017a.
  33. The" something something" video database for learning and evaluating visual common sense. In Proceedings of the IEEE international conference on computer vision, pages 5842–5850, 2017b.
  34. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9729–9738, 2020.
  35. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16000–16009, 2022.
  36. Deep networks with stochastic depth. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part IV 14, pages 646–661. Springer, 2016.
  37. Contrastive masked autoencoders are stronger vision learners. arXiv preprint arXiv:2207.13532, 2022.
  38. The kinetics human action video dataset, 2017.
  39. 3d object representations for fine-grained categorization. In Proceedings of the IEEE international conference on computer vision workshops, pages 554–561, 2013.
  40. Learning multiple layers of features from tiny images. Technical Report 0, University of Toronto, Toronto, Ontario, 2009.
  41. FFCV: Accelerating training by removing data bottlenecks. In Computer Vision and Pattern Recognition (CVPR), 2023. https://github.com/libffcv/ffcv/. commit 45f1274.
  42. Contrastive tuning: A little help to make masked autoencoders forget. arXiv preprint arXiv:2304.10520, 2023.
  43. Uniformerv2: Spatiotemporal learning by arming image vits with video uniformer. arXiv preprint arXiv:2211.09552, 2022a.
  44. Unmasked teacher: Towards training-efficient video foundation models. arXiv preprint arXiv:2303.16058, 2023.
  45. Mvitv2: Improved multiscale vision transformers for classification and detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4804–4814, 2022b.
  46. Polyvit: Co-training vision transformers on images, videos and audio. arXiv preprint arXiv:2111.12993, 2021.
  47. Swin transformer v2: Scaling up capacity and resolution. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12009–12019, 2022a.
  48. Video swin transformer. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3202–3211, 2022b.
  49. Sgdr: Stochastic gradient descent with warm restarts. In International Conference on Learning Representations, 2016.
  50. Decoupled weight decay regularization. In International Conference on Learning Representations, 2018.
  51. Deep predictive coding networks for video prediction and unsupervised learning. In International Conference on Learning Representations, 2016.
  52. Cmae-v: Contrastive masked autoencoders for video action recognition. arXiv preprint arXiv:2301.06018, 2023.
  53. Fine-grained visual classification of aircraft. arXiv preprint arXiv:1306.5151, 2013.
  54. Deep multi-scale video prediction beyond mean square error. In 4th International Conference on Learning Representations, ICLR 2016, 2016.
  55. A simple, efficient and scalable contrastive masked autoencoder for learning visual representations. arXiv preprint arXiv:2210.16870, 2022.
  56. Moments in time dataset: One million videos for event understanding. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(2):502–508, 2020.
  57. Automated flower classification over a large number of classes. In 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing, pages 722–729. IEEE, 2008.
  58. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
  59. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023.
  60. Cats and dogs. In 2012 IEEE conference on computer vision and pattern recognition, pages 3498–3505. IEEE, 2012.
  61. Self-supervised video pretraining yields strong image representations. arXiv preprint arXiv:2210.06433, 2022.
  62. Learning features by watching objects move. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2701–2710, 2017.
  63. Rethinking video vits: Sparse video tubes for joint image and video learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2214–2224, 2023.
  64. Spatiotemporal contrastive video representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6964–6974, 2021.
  65. Fine-tuning cnn image retrieval with no human annotation. IEEE transactions on pattern analysis and machine intelligence, 41(7):1655–1668, 2018.
  66. Unsupervised learning of video representations using lstms. In International conference on machine learning, pages 843–852. PMLR, 2015.
  67. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2818–2826, 2016.
  68. Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training. In Neural Information Processing Systems (NeurIPS), 2022.
  69. Deit iii: Revenge of the vit. In European Conference on Computer Vision, pages 516–533. Springer, 2022.
  70. Anticipating visual representations from unlabeled video. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 98–106, 2016.
  71. An uncertain future: Forecasting from static images using variational autoencoders. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part VII 14, pages 835–851. Springer, 2016.
  72. Bevt: Bert pretraining of video transformers. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14733–14743, 2022.
  73. Masked video distillation: Rethinking masked feature modeling for self-supervised video representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6312–6322, 2023.
  74. Unsupervised learning of visual representations using videos. In International Conference on Computer Vision (ICCV), 2015.
  75. Learning correspondence from the cycle-consistency of time. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2566–2576, 2019.
  76. Masked feature prediction for self-supervised visual pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14668–14678, 2022.
  77. Contrastive learning of image representations with cross-video cycle-consistency. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10149–10159, 2021.
  78. Sun database: Large-scale scene recognition from abbey to zoo. In 2010 IEEE computer society conference on computer vision and pattern recognition, pages 3485–3492. IEEE, 2010.
  79. Rethinking self-supervised correspondence learning: A video frame-level similarity perspective. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10075–10085, 2021.
  80. Multiview transformers for video recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3333–3343, 2022.
  81. mixup: Beyond empirical risk minimization. In International Conference on Learning Representations, 2018.
  82. Places: A 10 million image database for scene recognition. IEEE transactions on pattern analysis and machine intelligence, 40(6):1452–1464, 2017.
Citations (2)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets