Self-supervised Video Object Segmentation with Distillation Learning of Deformable Attention (2401.13937v2)
Abstract: Video object segmentation is a fundamental research problem in computer vision. Recent techniques have often applied attention mechanism to object representation learning from video sequences. However, due to temporal changes in the video data, attention maps may not well align with the objects of interest across video frames, causing accumulated errors in long-term video processing. In addition, existing techniques have utilised complex architectures, requiring highly computational complexity and hence limiting the ability to integrate video object segmentation into low-powered devices. To address these issues, we propose a new method for self-supervised video object segmentation based on distillation learning of deformable attention. Specifically, we devise a lightweight architecture for video object segmentation that is effectively adapted to temporal changes. This is enabled by deformable attention mechanism, where the keys and values capturing the memory of a video sequence in the attention module have flexible locations updated across frames. The learnt object representations are thus adaptive to both the spatial and temporal dimensions. We train the proposed architecture in a self-supervised fashion through a new knowledge distillation paradigm where deformable attention maps are integrated into the distillation loss. We qualitatively and quantitatively evaluate our method and compare it with existing methods on benchmark datasets including DAVIS 2016/2017 and YouTube-VOS 2018/2019. Experimental results verify the superiority of our method via its achieved state-of-the-art performance and optimal memory usage.
- ETC: Encoding long and structured inputs in transformers. arXiv preprint arXiv:2004.08483, 2020.
- Language models are few-shot learners. Neural Information Processing systems, 33:1877–1901, 2020.
- One-shot video object segmentation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 221–230, 2017.
- End-to-end object detection with transformers. In European Conference on Computer Vision, pages 213–229, 2020.
- Generative pretraining from pixels. In International Conference on Machine Learning, pages 1691–1703, 2020.
- Improved feature distillation via projector ensemble. In Neural Information Processing Systems, pages 12084–12095, 2022.
- Cascadepsp: Toward class-agnostic and very high-resolution segmentation via global and local refinement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8890–8899, 2020.
- Modular interactive video object segmentation: Interaction-to-mask, propagation and difference-aware fusion. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5559–5568, 2021a.
- Rethinking space-time networks with improved memory coverage for efficient video object segmentation. In Neural Information Processing Systems, pages 11781–11794, 2021b.
- Boxteacher: Exploring high-quality pseudo labels for weakly supervised instance segmentation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3145–3154, 2023.
- Deformable convolutional networks. In IEEE/CVF International Conference on Computer Vision, pages 764–773, 2017.
- Cswin transformer: A general vision transformer backbone with cross-shaped windows. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12124–12134, 2022.
- An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representation, 2021.
- SSTVOS: sparse spatiotemporal transformers for video object segmentation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5912–5921, 2021.
- Deep learning for video object segmentation: a review. Artificial Intelligence Review, 56(1):457–531, 2023.
- Measuring statistical dependence with hilbert-schmidt norms. In International Conference on Algorithmic Learning Theory, pages 63–77, 2005.
- VITA: video instance segmentation via object token association. In Neural Information Processing Systems, 2022.
- Motion-guided cascaded refinement network for video object segmentation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1400–1409, 2018.
- Knowledge distillation from a stronger teacher. In Neural Information Processing Systems, pages 33716–33727, 2022.
- Mining better samples for contrastive learning of temporal correspondence. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1034–1044, 2021.
- Similarity of neural network representations revisited. In International Conference on Machine Learning, pages 3519–3529, 2019.
- Self-supervised learning for video correspondence flow. In BMVC, 2019.
- Mast: A memory-augmented self-supervised tracker. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6479–6488, 2020.
- Locality-aware inter-and intra-video reconstruction for self-supervised correspondence learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8719–8730, 2022.
- Unified mask embedding and correspondence learning for self-supervised video segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18706–18716, 2023.
- Video object segmentation with joint re-identification and attention-aware mask propagation. In European Conference on Computer Vision, pages 90–105, 2018.
- Fss-1000: A 1000-class dataset for few-shot segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2869–2878, 2020.
- Function-consistent feature distillation. International Conference on Learning Representations, 2023.
- Swin transformer: Hierarchical vision transformer using shifted windows. In IEEE/CVF International Conference on Computer Vision, pages 10012–10022, 2021.
- Memory aggregation networks for efficient interactive video object segmentation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10363–10372, 2020.
- MobileVOS: Real-time video object segmentation contrastive learning meets knowledge distillation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10480–10490, 2023.
- Slide-transformer: Hierarchical vision transformer with local self-attention. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2082–2091, 2023.
- Learning student-friendly teacher networks for knowledge distillation. In Neural Information Processing Systems, pages 13292–13303, 2021.
- Per-clip video object segmentation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1352–1361, 2022.
- A benchmark dataset and evaluation methodology for video object segmentation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 724–732, 2016.
- Learning video object segmentation from static images. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2663–2672, 2017.
- The 2017 davis challenge on video object segmentation. arXiv preprint arXiv:1704.00675, 2017.
- Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pages 8748–8763, 2021.
- Searching for activation functions. arXiv preprint arXiv:1710.05941, 2017.
- Receptive field refinement for convolutional neural networks reliably improves predictive performance. arXiv preprint arXiv:2211.14487, 2022.
- Hierarchical image saliency detection on extended cssd. IEEE transactions on pattern analysis and machine intelligence, 38(4):717–729, 2015.
- Jeany Son. Contrastive learning for space-time correspondence via self-cycle consistency. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14679–14688, 2022.
- Segmenter: Transformer for semantic segmentation. In IEEE/CVF International Conference on Computer Vision, pages 7262–7272, 2021.
- Video segmentation via object flow. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3899–3908, 2016.
- Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems, pages 5998–6008, 2017.
- Learning to detect salient objects with image-level supervision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 136–145, 2017.
- Vision transformer with deformable attention. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4794–4803, 2022.
- Early convolutions help transformers see better. In Neural Information Processing Systems, pages 30392–30400, 2021.
- Side adapter network for open-vocabulary semantic segmentation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2945–2954, 2023.
- Youtube-vos: A large-scale video object segmentation benchmark. arXiv preprint arXiv:1809.03327, 2018.
- Decoupling features in hierarchical propagation for video object segmentation. In Neural Information Processing Systems, 2022.
- Associating objects with transformers for video object segmentation. In Neural Information Processing Systems, pages 2491–2502, 2021.
- Hierarchical spatiotemporal transformers for video object segmentation. arXiv preprint arXiv:2307.08263, 2023.
- Batman: Bilateral attention transformer in motion-appearance neighboring space for video object segmentation. In European Conference on Computer Vision, pages 612–629, 2022.
- Towards high-resolution salient object detection. In Proceedings of the IEEE/CVF international conference on computer vision, pages 7234–7243, 2019.
- Boosting video object segmentation via space-time correspondence learning. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2246–2256, 2023.
- Better teacher better student: Dynamic prior knowledge for knowledge distillation. In International Conference on Learning Representations, 2022.
- Quang-Trung Truong (3 papers)
- Duc Thanh Nguyen (23 papers)
- Binh-Son Hua (47 papers)
- Sai-Kit Yeung (52 papers)