Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
156 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

VideoMAC: Video Masked Autoencoders Meet ConvNets (2402.19082v1)

Published 29 Feb 2024 in cs.CV

Abstract: Recently, the advancement of self-supervised learning techniques, like masked autoencoders (MAE), has greatly influenced visual representation learning for images and videos. Nevertheless, it is worth noting that the predominant approaches in existing masked image / video modeling rely excessively on resource-intensive vision transformers (ViTs) as the feature encoder. In this paper, we propose a new approach termed as \textbf{VideoMAC}, which combines video masked autoencoders with resource-friendly ConvNets. Specifically, VideoMAC employs symmetric masking on randomly sampled pairs of video frames. To prevent the issue of mask pattern dissipation, we utilize ConvNets which are implemented with sparse convolutional operators as encoders. Simultaneously, we present a simple yet effective masked video modeling (MVM) approach, a dual encoder architecture comprising an online encoder and an exponential moving average target encoder, aimed to facilitate inter-frame reconstruction consistency in videos. Additionally, we demonstrate that VideoMAC, empowering classical (ResNet) / modern (ConvNeXt) convolutional encoders to harness the benefits of MVM, outperforms ViT-based approaches on downstream tasks, including video object segmentation (+\textbf{5.2\%} / \textbf{6.4\%} $\mathcal{J}&\mathcal{F}$), body part propagation (+\textbf{6.3\%} / \textbf{3.1\%} mIoU), and human pose tracking (+\textbf{10.2\%} / \textbf{11.1\%} [email protected]).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (75)
  1. Masked siamese networks for label-efficient learning. In ECCV, pages 456–473, 2022.
  2. BEiT: BERT pre-training of image transformers. In ICLR, 2022.
  3. Language models are few-shot learners. In NeurIPS, pages 1877–1901, 2020.
  4. One-shot video object segmentation. In CVPR, pages 221–230, 2017.
  5. Unsupervised learning of visual features by contrasting cluster assignments. In NeurIPS, pages 9912–9924, 2020.
  6. Emerging properties in self-supervised vision transformers. In ICCV, pages 9650–9660, 2021.
  7. A simple framework for contrastive learning of visual representations. In ICML, pages 1597–1607, 2020.
  8. Exploring simple siamese representation learning. In CVPR, pages 15750–15758, 2021.
  9. 4d spatio-temporal convnets: Minkowski convolutional neural networks. In CVPR, pages 3075–3084, 2019.
  10. Electra: Pre-training text encoders as discriminators rather than generators. In ICLR, 2020.
  11. Imagenet: A large-scale hierarchical image database. In CVPR, pages 248–255, 2009.
  12. Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL, 2019.
  13. Unified language model pre-training for natural language understanding and generation. In NeurIPS, 2019.
  14. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.
  15. Eva: Exploring the limits of masked visual representation learning at scale. In CVPR, pages 19358–19369, 2023.
  16. Masked autoencoders as spatiotemporal learners. In NeurIPS, pages 35946–35958, 2022.
  17. Mcmae: Masked convolution meets masked autoencoders. NeurIPS, 35:35632–35644, 2022.
  18. Bootstrap your own latent-a new approach to self-supervised learning. In NeurIPS, pages 21271–21284, 2020.
  19. Maskvit: Masked visual pre-training for video prediction. In ICLR, 2023a.
  20. Siamese masked autoencoders. In NeurIPS, 2023b.
  21. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
  22. Momentum contrast for unsupervised visual representation learning. In CVPR, pages 9729–9738, 2020.
  23. Masked autoencoders are scalable vision learners. In CVPR, pages 16000–16009, 2022.
  24. Object discovery and representation networks. In ECCV, pages 123–143, 2022.
  25. Space-time correspondence as a contrastive random walk. In NeurIPS, pages 19545–19560, 2020.
  26. Towards understanding action recognition. In ICCV, pages 3192–3199, 2013.
  27. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950, 2017.
  28. Imagenet classification with deep convolutional neural networks. In NeurIPS, 2012.
  29. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
  30. Exploring the role of mean teachers in self-supervised masked auto-encoders. In ICLR, 2023.
  31. Locality-aware inter-and intra-video reconstruction for self-supervised correspondence learning. In CVPR, pages 8719–8730, 2022.
  32. Unified mask embedding and correspondence learning for self-supervised video segmentation. In CVPR, pages 18706–18716, 2023.
  33. Mixmae: Mixed and masked autoencoder for efficient pretraining of hierarchical vision transformers. In CVPR, pages 6252–6261, 2023a.
  34. Improving pixel-based mim by reducing wasted modeling capability. In ICCV, pages 5361–5372, 2023b.
  35. Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV, pages 10012–10022, 2021.
  36. A convnet for the 2020s. In CVPR, pages 11976–11986, 2022a.
  37. Video swin transformer. In CVPR, pages 3202–3211, 2022b.
  38. Sgdr: Stochastic gradient descent with warm restarts. In ICLR, 2017.
  39. Decoupled weight decay regularization. In ICLR, 2019.
  40. Video object segmentation with episodic graph memory networks. In ECCV, pages 661–679, 2020a.
  41. Learning video object segmentation from unlabeled videos. In CVPR, pages 8960–8970, 2020b.
  42. Hierarchical feature alignment network for unsupervised video object segmentation. In ECCV, pages 596–613, 2022.
  43. Hierarchical graph pattern understanding for zero-shot video object segmentation. TIP, 32:5909–5920, 2023a.
  44. Hierarchical co-attention propagation network for zero-shot video object segmentation. TIP, pages 2348–2359, 2023b.
  45. A benchmark dataset and evaluation methodology for video object segmentation. In CVPR, pages 724–732, 2016.
  46. The 2017 davis challenge on video object segmentation. arXiv preprint arXiv:1704.00675, 2017.
  47. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  48. Learning transferable visual models from natural language supervision. In ICML, pages 8748–8763, 2021.
  49. Hiera: A hierarchical vision transformer without the bells-and-whistles. ICML, 2023.
  50. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015.
  51. Thin-slicing network: A deep structured model for pose estimation in videos. In CVPR, pages 4220–4229, 2017.
  52. Croc: Cross-view online clustering for dense visual representation learning. In CVPR, pages 7000–7009, 2023.
  53. Efficientnet: Rethinking model scaling for convolutional neural networks. In ICML, pages 6105–6114, 2019.
  54. Designing bert for convolutional networks: Sparse and hierarchical masked modeling. In ICLR, 2023a.
  55. Integrally pre-trained transformer pyramid networks. In CVPR, pages 18610–18620, 2023b.
  56. Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training. In NeurIPS, pages 10078–10093, 2022.
  57. Training data-efficient image transformers & distillation through attention. In ICML, pages 10347–10357, 2021.
  58. Videomae v2: Scaling video masked autoencoders with dual masking. In CVPR, pages 14549–14560, 2023a.
  59. Masked video distillation: Rethinking masked feature modeling for self-supervised video representation learning. In CVPR, pages 6312–6322, 2023b.
  60. Zero-shot video object segmentation via attentive graph neural networks. In ICCV, pages 9236–9245, 2019a.
  61. Learning unsupervised video object segmentation through visual attention. In CVPR, pages 3064–3074, 2019b.
  62. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In ICCV, pages 568–578, 2021a.
  63. Dense contrastive learning for self-supervised visual pre-training. In CVPR, pages 3024–3033, 2021b.
  64. Convnext v2: Co-designing and scaling convnets with masked autoencoders. In CVPR, pages 16133–16142, 2023.
  65. Dropmae: Masked autoencoders with spatial-attention dropout for tracking tasks. In CVPR, pages 14561–14571, 2023a.
  66. Scalable video object segmentation with simplified framework. In ICCV, pages 13879–13889, 2023b.
  67. Simmim: A simple framework for masked image modeling. In CVPR, pages 9653–9663, 2022.
  68. Youtube-vos: Sequence-to-sequence video object segmentation. In ECCV, pages 585–601, 2018.
  69. Non-salient region object mining for weakly supervised semantic segmentation. In CVPR, pages 2623–2632, 2021.
  70. Hivit: A simpler and more efficient design of hierarchical vision transformer. In ICLR, 2023a.
  71. Boosting video object segmentation via space-time correspondence learning. In CVPR, pages 2246–2256, 2023b.
  72. Image bert pre-training with online tokenizer. In ICLR, 2022.
  73. Adaptive temporal encoding network for video instance-level human parsing. In ACMMM, pages 1527–1535, 2018.
  74. A survey on deep learning technique for video segmentation. TPAMI, 45(6):7099–7122, 2023.
  75. Flow-guided feature aggregation for video object detection. In ICCV, pages 408–417, 2017.
Citations (4)

Summary

We haven't generated a summary for this paper yet.