Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
120 tokens/sec
GPT-4o
11 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Depth-aware Test-Time Training for Zero-shot Video Object Segmentation (2403.04258v1)

Published 7 Mar 2024 in cs.CV

Abstract: Zero-shot Video Object Segmentation (ZSVOS) aims at segmenting the primary moving object without any human annotations. Mainstream solutions mainly focus on learning a single model on large-scale video datasets, which struggle to generalize to unseen videos. In this work, we introduce a test-time training (TTT) strategy to address the problem. Our key insight is to enforce the model to predict consistent depth during the TTT process. In detail, we first train a single network to perform both segmentation and depth prediction tasks. This can be effectively learned with our specifically designed depth modulation layer. Then, for the TTT process, the model is updated by predicting consistent depth maps for the same frame under different data augmentations. In addition, we explore different TTT weight updating strategies. Our empirical results suggest that the momentum-based weight initialization and looping-based training scheme lead to more stable improvements. Experiments show that the proposed method achieves clear improvements on ZSVOS. Our proposed video TTT strategy provides significant superiority over state-of-the-art TTT methods. Our code is available at: https://nifangbaage.github.io/DATTT.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (63)
  1. Self-supervised test-time adaptation on video data. In WACV, 2022.
  2. Zoedepth: Zero-shot transfer by combining relative and metric depth. arXiv, 2023.
  3. Object segmentation by long term analysis of point trajectories. In ECCV, 2010.
  4. One-shot video object segmentation. In CVPR, 2017.
  5. Progressively complementarity-aware fusion network for rgb-d salient object detection. In CVPR, 2018.
  6. Treating motion as option to reduce motion dependency in unsupervised video object segmentation. In WACV, 2023.
  7. Video object segmentation by learning location-sensitive embeddings. In ECCV, 2018.
  8. Neural network approach to background modeling for video object segmentation. Transactions on Neural Networks, 2007.
  9. Defocus blur detection via depth distillation. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XIII 16, pages 747–763. Springer, 2020.
  10. Depth map prediction from a single image using a multi-scale deep network. In NeurIPS, 2014.
  11. Test-time training with masked autoencoders. NeurIPS, 2022.
  12. Digging into self-supervised monocular depth estimation. In ICCV, 2019.
  13. Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet? In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 6546–6555, 2018.
  14. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  15. Vrstc: Occlusion-free video person re-identification. In CVPR, 2019.
  16. Med-vt: Multiscale encoder-decoder video transformer with application to object segmentation. In CVPR, 2023.
  17. Spatiotemporal saliency detection for video sequences based on random walk with restart. IEEE TIP, 2015.
  18. Key-segments for video object segmentation. In ICCV, 2011.
  19. Video segmentation by tracking many figure-ground segments. In ICCV, 2013.
  20. Video object segmentation with adaptive feature bank and uncertain-region refinement. In NeurIPS, 2020.
  21. Learning selective self-mutual attention for rgb-d saliency detection. In CVPR, 2020.
  22. Ttt++: When does self-supervised test-time training fail or thrive? NeurIPS, 2021a.
  23. Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV, 2021b.
  24. Tam: Temporal adaptive module for video recognition. In ICCV, 2021c.
  25. Video swin transformer. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3202–3211, 2022.
  26. Making a case for 3d convolutions for object segmentation in videos. BMVC, 2020.
  27. Higher order motion models and spectral clustering. In CVPR, 2012.
  28. Segmentation of moving objects by long term video analysis. TPAMI, 2013.
  29. Hierarchical feature alignment network for unsupervised video object segmentation. In ECCV, 2022.
  30. Hierarchical co-attention propagation network for zero-shot video object segmentation. IEEE TIP, 2023.
  31. A benchmark dataset and evaluation methodology for video object segmentation. In CVPR, 2016.
  32. Depth-induced multi-scale recurrent attention network for saliency detection. In ICCV, 2019.
  33. A2dele: Adaptive and attentive depth distiller for efficient rgb-d salient object detection. In CVPR, 2020.
  34. Fatezero: Fusing attentions for zero-shot text-based video editing. arXiv:2303.09535, 2023.
  35. Reciprocal transformations for unsupervised video object segmentation. In CVPR, 2021.
  36. Improving robustness against common corruptions by covariate shift adaptation. NIPS, 2020.
  37. Pwc-net: Cnns for optical flow using pyramid, warping, and cost volume. In CVPR, 2018.
  38. Deep rgb-d saliency detection with depth-sensitive attention and automatic multi-modal fusion. In CVPR, 2021.
  39. Test-time training with self-supervision for generalization under distribution shifts. In ICML, 2020.
  40. Raft: Recurrent all-pairs field transforms for optical flow. In ECCV, 2020.
  41. Learning video object segmentation with visual memory. In ICCV, 2017.
  42. Learning to segment moving objects. IJCV, 2019.
  43. Learning to adapt for stereo. In CVPR, 2019a.
  44. Real-time self-adaptive deep stereo. In CVPR, 2019b.
  45. Rvos: End-to-end recurrent network for video object segmentation. In CVPR, 2019.
  46. Online adaptation of convolutional neural networks for video object segmentation. In BMVC, 2017.
  47. On the road to online adaptation for semantic image segmentation. In CVPR, 2022.
  48. Tent: Fully test-time adaptation by entropy minimization. In ICLR, 2020.
  49. Test-time training on video streams. arXiv, 2023.
  50. Zero-shot video object segmentation via attentive graph neural networks. In ICCV, 2019a.
  51. Learning unsupervised video object segmentation through visual attention. In CVPR, 2019b.
  52. Segformer: Simple and efficient design for semantic segmentation with transformers. NeurIPS, 2021.
  53. Youtube-vos: Sequence-to-sequence video object segmentation. In ECCV, 2018.
  54. Learning motion-appearance co-attention for zero-shot video object segmentation. In ICCV, 2021.
  55. Video object segmentation through spatially accurate and temporally dense extraction of primary object regions. In CVPR, 2013.
  56. Deep transport network for unsupervised video object segmentation. In ICCV, 2021.
  57. Lite-mono: A lightweight cnn and transformer architecture for self-supervised monocular depth estimation. In CVPR, 2023a.
  58. Sadtalker: Learning realistic 3d motion coefficients for stylized audio-driven single image talking face animation. In CVPR, 2023b.
  59. Contrast prior and fluid pyramid integration for rgbd salient object detection. In CVPR, 2019.
  60. Motion-attentive transition for zero-shot video object segmentation. In AAAI, 2020.
  61. Specificity-preserving rgb-d saliency detection. In ICCV, 2021.
  62. Unsupervised online video object segmentation with motion property understanding. IEEE TIP, 2019.
  63. Zoran Zivkovic and Ferdinand Van Der Heijden. Efficient adaptive density estimation per image pixel for the task of background subtraction. Pattern Recognition Letters, 2006.
Citations (13)

Summary

We haven't generated a summary for this paper yet.