Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 134 tok/s
Gemini 2.5 Pro 41 tok/s Pro
GPT-5 Medium 38 tok/s Pro
GPT-5 High 34 tok/s Pro
GPT-4o 133 tok/s Pro
Kimi K2 203 tok/s Pro
GPT OSS 120B 441 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

Refining Pre-Trained Motion Models (2401.00850v2)

Published 1 Jan 2024 in cs.CV and cs.AI

Abstract: Given the difficulty of manually annotating motion in video, the current best motion estimation methods are trained with synthetic data, and therefore struggle somewhat due to a train/test gap. Self-supervised methods hold the promise of training directly on real video, but typically perform worse. These include methods trained with warp error (i.e., color constancy) combined with smoothness terms, and methods that encourage cycle-consistency in the estimates (i.e., tracking backwards should yield the opposite trajectory as tracking forwards). In this work, we take on the challenge of improving state-of-the-art supervised models with self-supervised training. We find that when the initialization is supervised weights, most existing self-supervision techniques actually make performance worse instead of better, which suggests that the benefit of seeing the new data is overshadowed by the noise in the training signal. Focusing on obtaining a "clean" training signal from real-world unlabelled video, we propose to separate label-making and training into two distinct stages. In the first stage, we use the pre-trained model to estimate motion in a video, and then select the subset of motion estimates which we can verify with cycle-consistency. This produces a sparse but accurate pseudo-labelling of the video. In the second stage, we fine-tune the model to reproduce these outputs, while also applying augmentations on the input. We complement this boot-strapping method with simple techniques that densify and re-balance the pseudo-labels, ensuring that we do not merely train on "easy" tracks. We show that our method yields reliable gains over fully-supervised methods in real videos, for both short-term (flow-based) and long-range (multi-frame) pixel tracking.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (40)
  1. A. Dosovitskiy, P. Fischer, E. Ilg, P. Hausser, C. Hazirbas, V. Golkov, P. Van Der Smagt, D. Cremers, and T. Brox, “Flownet: Learning optical flow with convolutional networks,” in ICCV, 2015, pp. 2758–2766.
  2. N. Mayer, E. Ilg, P. Häusser, P. Fischer, D. Cremers, A. Dosovitskiy, and T. Brox, “A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation,” in CVPR, 2016.
  3. D. J. Butler, J. Wulff, G. B. Stanley, and M. J. Black, “A naturalistic open source movie for optical flow evaluation,” in ECCV, 2012, pp. 611–625.
  4. W. Wang, D. Zhu, X. Wang, Y. Hu, Y. Qiu, C. Wang, Y. Hu, A. Kapoor, and S. Scherer, “Tartanair: A dataset to push the limits of visual slam,” in 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).   IEEE, 2020, pp. 4909–4916.
  5. Y. Zheng, A. W. Harley, B. Shen, G. Wetzstein, and L. J. Guibas, “Pointodyssey: A large-scale synthetic dataset for long-term point tracking,” in ICCV, 2023.
  6. J. J. Yu, A. W. Harley, and K. G. Derpanis, “Back to basics: Unsupervised learning of optical flow via brightness constancy and motion smoothness,” in ECCVW.   Springer, 2016, pp. 3–10.
  7. P. Liu, M. Lyu, I. King, and J. Xu, “Selflow: Self-supervised learning of optical flow,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 4571–4580.
  8. A. Stone, D. Maurer, A. Ayvaci, A. Angelova, and R. Jonschkowski, “Smurf: Self-teaching multi-frame unsupervised raft with full-image warping,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 3887–3896.
  9. N. Wang, Y. Song, C. Ma, W. Zhou, W. Liu, and H. Li, “Unsupervised deep tracking,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 1308–1317.
  10. X. Li, S. Liu, S. De Mello, X. Wang, J. Kautz, and M.-H. Yang, “Joint-task self-supervised learning for temporal correspondence,” Advances in Neural Information Processing Systems, vol. 32, 2019.
  11. X. Wang, A. Jabri, and A. A. Efros, “Learning correspondence from the cycle-consistency of time,” in CVPR, 2019.
  12. A. Jabri, A. Owens, and A. A. Efros, “Space-time correspondence as a contrastive random walk,” Advances in Neural Information Processing Systems, 2020.
  13. Z. Teed and J. Deng, “RAFT: Recurrent all-pairs field transforms for optical flow,” https://github.com/princeton-vl/RAFT, 2020.
  14. A. W. Harley, Z. Fang, and K. Fragkiadaki, “Particle video revisited: Tracking through occlusions using point trajectories,” in ECCV, 2022.
  15. R. Sundararaman, C. De Almeida Braga, E. Marchand, and J. Pettre, “Tracking pedestrian heads in dense crowd,” in CVPR, 2021, pp. 3865–3875.
  16. A. Mathis, T. Biasi, S. Schneider, M. Yuksekgonul, B. Rogers, M. Bethge, and M. W. Mathis, “Pretraining boosts out-of-domain robustness for pose estimation,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2021, pp. 1859–1868.
  17. C. Doersch, A. Gupta, L. Markeeva, A. Recasens, L. Smaira, Y. Aytar, J. Carreira, A. Zisserman, and Y. Yang, “Tap-vid: A benchmark for tracking any point in a video,” arXiv preprint arXiv:2211.03726, 2022.
  18. P. J. Burt and E. H. Adelson, “The laplacian pyramid as a compact image code,” in Readings in computer vision.   Elsevier, 1987, pp. 671–679.
  19. R. Jonschkowski, A. Stone, J. T. Barron, A. Gordon, K. Konolige, and A. Angelova, “What matters in unsupervised optical flow,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part II 16.   Springer, 2020, pp. 557–572.
  20. W. Im, S. Lee, and S.-E. Yoon, “Semi-supervised learning of optical flow by flow supervisor,” in ECCV.   Springer, 2022, pp. 302–318.
  21. G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” arXiv preprint arXiv:1503.02531, 2015.
  22. Y. Tang, Z. Jiang, Z. Xie, Y. Cao, Z. Zhang, P. H. Torr, and H. Hu, “Breaking shortcut: Exploring fully convolutional cycle-consistency for video correspondence learning,” arXiv preprint arXiv:2105.05838, 2021.
  23. Z. Bian, A. Jabri, A. A. Efros, and A. Owens, “Learning pixel trajectories with multiscale contrastive random walks,” CVPR, 2022.
  24. J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet: A Large-Scale Hierarchical Image Database,” in CVPR09, 2009.
  25. Z. Teed and J. Deng, “RAFT: Recurrent all-pairs field transforms for optical flow,” in ECCV, 2020.
  26. W.-S. Lai, J.-B. Huang, and M.-H. Yang, “Semi-supervised learning for optical flow with generative adversarial networks,” Advances in neural information processing systems, vol. 30, 2017.
  27. J. Jeong, J. M. Lin, F. Porikli, and N. Kwak, “Imposing consistency for optical flow estimation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 3181–3191.
  28. D.-H. Lee et al., “Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks,” in Workshop on challenges in representation learning, ICML, vol. 3, no. 2, 2013, p. 896.
  29. A. Carlson, J. Betteridge, B. Kisiel, B. Settles, E. R. Hruschka, and T. M. Mitchell, “Toward an architecture for never-ending language learning,” in Proceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence, ser. AAAI’10.   AAAI Press, 2010, p. 1306–1313.
  30. X. Chen, A. Shrivastava, and A. Gupta, “NEIL: Extracting visual knowledge from web data,” in ICCV, 2013, pp. 1409–1416.
  31. N. Sundaram, T. Brox, and K. Keutzer, “Dense point trajectories by GPU-accelerated large displacement optical flow,” in ECCV, 2010.
  32. J. Pont-Tuset, F. Perazzi, S. Caelles, P. Arbeláez, A. Sorkine-Hornung, and L. Van Gool, “The 2017 DAVIS challenge on video object segmentation,” arXiv:1704.00675, 2017.
  33. M. Caron, H. Touvron, I. Misra, H. Jégou, J. Mairal, P. Bojanowski, and A. Joulin, “Emerging properties in self-supervised vision transformers,” in ICCV, 2021.
  34. A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” in ICLR, 2021.
  35. Z. Lai, E. Lu, and W. Xie, “MAST: A memory-augmented self-supervised tracker,” in CVPR, 2020.
  36. J. Lin, C. Gan, and S. Han, “Tsm: Temporal shift module for efficient video understanding,” in ICCV, 2019, pp. 7083–7093.
  37. W. Jiang, E. Trulls, J. Hosang, A. Tagliasacchi, and K. M. Yi, “COTR: Correspondence Transformer for Matching Across Images,” in ICCV, 2021.
  38. E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, and T. Brox, “Flownet 2.0: Evolution of optical flow estimation with deep networks,” in CVPR, 2017.
  39. D. Sun, X. Yang, M.-Y. Liu, and J. Kautz, “PWC-Net: CNNs for optical flow using pyramid, warping, and cost volume,” in CVPR, 2018.
  40. G. Yang and D. Ramanan, “Volumetric correspondence networks for optical flow,” Advances in neural information processing systems, vol. 32, 2019.
Citations (4)

Summary

  • The paper introduces a two-phase refinement strategy that generates reliable pseudo-labels using cycle-consistency checks.
  • It leverages pre-trained models and self-supervised techniques to fine-tune motion estimation on challenging real video data.
  • Experiments on optical flow and multi-frame tracking reveal consistent accuracy improvements over fully supervised methods.

Overview of the Paper

The paper introduces a novel framework for refining pre-trained motion models to better accommodate real-world video data. This need arises because current motion models are often trained on synthetic data and encounter difficulties when applied to real videos due to discrepancies between training and application conditions. While self-supervised techniques show potential for training directly on real video, they usually underperform compared to their supervised counterparts.

Self-Supervised Challenges in Motion Estimation

Self-supervised motion estimation models historically perform worse than supervised models because they are trained on less precise signals. These models typically rely on color constancy—the assumption that pixel color remains consistent between frames—and on smoothness constraints that penalize rapid changes between adjacent pixels. Additionally, these models aim for cycle-consistency, where a motion trajectory should reverse predictably if time is reversed.

A Two-Stage Refinement Strategy

The authors propose a two-phase process beginning with pseudo-label generation. They employ pre-trained models to estimate motion in videos and select reliable subsets of motion estimates verified by cycle-consistency. The model produces sparse yet accurate pseudo-labels—effectively estimated true labels. The second phase involves model fine-tuning using these pseudo-labels, challenging the model to replicate its own most reliable estimates even in more complex, augmented input scenarios. Techniques are introduced to densify pseudo-label sets and avoid bias toward simpler motion tracks.

Results and Potential Impact

The methodology yields improvements in motion model prediction accuracy on real videos. Experiments conducted on optical flow and multi-frame point tracking models demonstrate consistent enhancements when compared to fully-supervised methods alone. The authors hope their work will stimulate further exploration into pre-trained motion model refinement.

The paper's findings have clear implications for advancing the state of video motion analysis, potentially contributing to numerous applications in surveillance, autonomous systems, and filmmaking, where accurate motion tracking is essential.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 1 tweet and received 76 likes.

Upgrade to Pro to view all of the tweets about this paper: