Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Spatial-then-Temporal Self-Supervised Learning for Video Correspondence (2209.07778v5)

Published 16 Sep 2022 in cs.CV

Abstract: In low-level video analyses, effective representations are important to derive the correspondences between video frames. These representations have been learned in a self-supervised fashion from unlabeled images or videos, using carefully designed pretext tasks in some recent studies. However, the previous work concentrates on either spatial-discriminative features or temporal-repetitive features, with little attention to the synergy between spatial and temporal cues. To address this issue, we propose a spatial-then-temporal self-supervised learning method. Specifically, we firstly extract spatial features from unlabeled images via contrastive learning, and secondly enhance the features by exploiting the temporal cues in unlabeled videos via reconstructive learning. In the second step, we design a global correlation distillation loss to ensure the learning not to forget the spatial cues, and a local correlation distillation loss to combat the temporal discontinuity that harms the reconstruction. The proposed method outperforms the state-of-the-art self-supervised methods, as established by the experimental results on a series of correspondence-based video analysis tasks. Also, we performed ablation studies to verify the effectiveness of the two-step design as well as the distillation losses.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (60)
  1. Dense unsupervised learning for video segmentation. In NeurIPS, pages 25308–25319, 2021.
  2. A framework for the robust estimation of optical flow. In ICCV, pages 231–236, 1993.
  3. One-shot video object segmentation. In CVPR, pages 221–230, 2017.
  4. Unsupervised learning of visual features by contrasting cluster assignments. In NeurIPS, pages 9912–9924, 2020.
  5. Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR, pages 6299–6308, 2017.
  6. A simple framework for contrastive learning of visual representations. In ICML, pages 1597–1607, 2020.
  7. Improved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297, 2020.
  8. Exploring simple siamese representation learning. In CVPR, pages 15750–15758, 2021.
  9. Imagenet: A large-scale hierarchical image database. In CVPR, pages 248–255, 2009.
  10. Flownet: Learning optical flow with convolutional networks. In ICCV, pages 2758–2766, 2015.
  11. The pascal visual object classes challenge: A retrospective. International journal of computer vision, 111(1):98–136, 2015.
  12. From lifestyle vlogs to everyday interactions. In CVPR, pages 4991–5000, 2018.
  13. Watching the world go by: Representation learning from unlabeled videos. arXiv preprint arXiv:2003.07990, 2020.
  14. Bootstrap your own latent-a new approach to self-supervised learning. In NeurIPS, pages 21271–21284, 2020.
  15. Momentum contrast for unsupervised visual representation learning. In CVPR, pages 9729–9738, 2020.
  16. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
  17. Determining optical flow. Artificial intelligence, 1981.
  18. Semantic-aware fine-grained correspondence. In ECCV, pages 97–115, 2022.
  19. Space-time correspondence as a contrastive random walk. In NeurIPS, pages 19545–19560, 2020.
  20. Mining better samples for contrastive learning of temporal correspondence. In CVPR, pages 1034–1044, 2021.
  21. Towards understanding action recognition. In ICCV, pages 3192–3199, 2013.
  22. What matters in unsupervised optical flow. In ECCV, pages 557–572, 2020.
  23. Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences, 114(13):3521–3526, 2017.
  24. Mast: A memory-augmented self-supervised tracker. In CVPR, pages 6479–6488, 2020.
  25. Self-supervised learning for video correspondence flow. In BMVC, 2019.
  26. Locality-aware inter-and intra-video reconstruction for self-supervised correspondence learning. In CVPR, pages 8719–8730, 2022.
  27. Joint-task self-supervised learning for temporal correspondence. In NeurIPS, pages 318–328, 2019.
  28. Microsoft coco: Common objects in context. In ECCV, pages 740–755, 2014.
  29. Learning by analogy: Reliable supervision from transformations for unsupervised optical flow estimation. In CVPR, pages 6489–6498, 2020.
  30. Ddflow: Learning optical flow with unlabeled data distillation. In AAAI, pages 8770–8777, 2019.
  31. Learning video object segmentation from unlabeled videos. In CVPR, pages 8960–8970, 2020.
  32. Video object segmentation without temporal information. IEEE transactions on PAMI, 41(6):1515–1530, 2018.
  33. A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In CVPR, pages 4040–4048, 2016.
  34. Unflow: Unsupervised learning of optical flow with a bidirectional census loss. In AAAI, 2018.
  35. Self-supervised video object segmentation by motion-aware mask propagation. In ICME, pages 1–6, 2022.
  36. Trackingnet: A large-scale dataset and benchmark for object tracking in the wild. In ECCV, pages 300–317, 2018.
  37. Video object segmentation using space-time memory networks. In ICCV, pages 9226–9235, 2019.
  38. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
  39. The 2017 davis challenge on video object segmentation. arXiv preprint arXiv:1704.00675, 2017.
  40. Jeany Son. Contrastive learning for space-time correspondence via self-cycle consistency. In CVPR, pages 14679–14688, 2022.
  41. Thin-slicing network: A deep structured model for pose estimation in videos. In CVPR, pages 4220–4229, 2017.
  42. Raft: Recurrent all-pairs field transforms for optical flow. In ECCV, pages 402–419, 2020.
  43. Long-term tracking in the wild: A benchmark. In ECCV, pages 670–685, 2018.
  44. Feelvos: Fast end-to-end embedding learning for video object segmentation. In CVPR, pages 9481–9490, 2019.
  45. Online adaptation of convolutional neural networks for video object segmentation. arXiv preprint arXiv:1706.09364, 2017.
  46. Tracking emerges by colorizing videos. In ECCV, pages 391–408, 2018.
  47. Contrastive transformation for self-supervised correspondence learning. In AAAI, pages 10174–10182, 2020.
  48. Learning correspondence from the cycle-consistency of time. In CVPR, pages 2566–2576, 2019.
  49. Dense contrastive learning for self-supervised visual pre-training. In CVPR, pages 3024–3033, 2021.
  50. Do different tracking tasks require different appearance models? In NeurIPS, pages 726–738, 2021.
  51. Unsupervised feature learning via non-parametric instance discrimination. In CVPR, pages 3733–3742, 2018.
  52. Detco: Unsupervised contrastive learning for object detection. In ICCV, pages 8392–8401, 2021.
  53. Pose flow: Efficient online pose tracking. In BMVC, 2018.
  54. Rethinking self-supervised correspondence learning: A video frame-level similarity perspective. In ICCV, pages 10075–10085, 2021.
  55. Youtube-vos: A large-scale video object segmentation benchmark. arXiv preprint arXiv:1809.03327, 2018.
  56. Instance localization for self-supervised detection pretraining. In CVPR, pages 3987–3996, 2021.
  57. Articulated human detection with flexible mixtures of parts. IEEE transactions on PAMI, 35(12):2878–2890, 2012.
  58. Back to basics: Unsupervised learning of optical flow via brightness constancy and motion smoothness. In ECCV, pages 3–10, 2016.
  59. Modelling neighbor relation in joint space-time graph for video correspondence learning. In ICCV, pages 9960–9969, 2021.
  60. Adaptive temporal encoding network for video instance-level human parsing. In ACM MM, pages 1527–1535, 2018.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Rui Li (384 papers)
  2. Dong Liu (267 papers)
Citations (11)

Summary

We haven't generated a summary for this paper yet.