SAMFlow: Eliminating Any Fragmentation in Optical Flow with Segment Anything Model (2307.16586v4)
Abstract: Optical Flow Estimation aims to find the 2D dense motion field between two frames. Due to the limitation of model structures and training datasets, existing methods often rely too much on local clues and ignore the integrity of objects, resulting in fragmented motion estimation. Through theoretical analysis, we find the pre-trained large vision models are helpful in optical flow estimation, and we notice that the recently famous Segment Anything Model (SAM) demonstrates a strong ability to segment complete objects, which is suitable for solving the fragmentation problem. We thus propose a solution to embed the frozen SAM image encoder into FlowFormer to enhance object perception. To address the challenge of in-depth utilizing SAM in non-segmentation tasks like optical flow estimation, we propose an Optical Flow Task-Specific Adaption scheme, including a Context Fusion Module to fuse the SAM encoder with the optical flow context encoder, and a Context Adaption Module to adapt the SAM features for optical flow task with Learned Task-Specific Embedding. Our proposed SAMFlow model reaches 0.86/2.10 clean/final EPE and 3.55/12.32 EPE/F1-all on Sintel and KITTI-15 training set, surpassing Flowformer by 8.5%/9.9% and 13.2%/16.3%. Furthermore, our model achieves state-of-the-art performance on the Sintel and KITTI-15 benchmarks, ranking #1 among all two-frame methods on Sintel clean pass.
- A naturalistic open source movie for optical flow evaluation. In Computer Vision–ECCV 2012: 12th European Conference on Computer Vision, Florence, Italy, October 7-13, 2012, Proceedings, Part VI 12, 611–625. Springer.
- A simple framework for contrastive learning of visual representations. In International conference on machine learning, 1597–1607. PMLR.
- Context-Aware Iteration Policy Network for Efficient Optical Flow Estimation. arXiv:2312.07180.
- Rethinking Optical Flow from Geometric Matching Consistent Perspective. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 1337–1347.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.
- Flownet: Learning optical flow with convolutional networks. In Proceedings of the IEEE international conference on computer vision, 2758–2766.
- Virtual Worlds as Proxy for Multi-Object Tracking Analysis. In CVPR.
- Flow-edge guided video completion. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XII 16, 713–729. Springer.
- Vision meets robotics: The kitti dataset. The International Journal of Robotics Research, 32(11): 1231–1237.
- Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 16000–16009.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, 770–778.
- Determining optical flow. Artificial intelligence, 17(1-3): 185–203.
- Flowformer: A transformer architecture for optical flow. In European Conference on Computer Vision, 668–685. Springer.
- Real-time intermediate flow estimation for video frame interpolation. In European Conference on Computer Vision, 624–642. Springer.
- Liteflownet: A lightweight convolutional neural network for optical flow estimation. In Proceedings of the IEEE conference on computer vision and pattern recognition, 8981–8989.
- A lightweight optical flow cnn—revisiting data fidelity and regularization. IEEE transactions on pattern analysis and machine intelligence, 43(8): 2555–2569.
- Perceiver IO: A General Architecture for Structured Inputs & Outputs. In International Conference on Learning Representations.
- Learning to estimate hidden motions with global motion aggregation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 9772–9781.
- Learning optical flow from a few matches. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 16592–16600.
- Segment anything. arXiv preprint arXiv:2304.02643.
- The hci benchmark suite: Stereo and flow ground truth with uncertainties for urban autonomous driving. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 19–28.
- Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25.
- An iterative image registration technique with an application to stereo vision. In IJCAI’81: 7th international joint conference on Artificial intelligence, volume 2, 674–679.
- A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In Proceedings of the IEEE conference on computer vision and pattern recognition, 4040–4048.
- Spring: A High-Resolution High-Detail Dataset and Benchmark for Scene Flow, Optical Flow and Stereo. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4981–4991.
- Videoflow: Exploiting temporal cues for multi-frame optical flow estimation. arXiv preprint arXiv:2303.08340.
- Flowformer++: Masked cost volume autoencoding for pretraining optical flow estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 1599–1610.
- Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.
- Craft: Cross-attentional flow transformer for robust optical flow. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, 17602–17611.
- Autoflow: Learning a better training set for optical flow. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10093–10102.
- Pwc-net: Cnns for optical flow using pyramid, warping, and cost volume. In Proceedings of the IEEE conference on computer vision and pattern recognition, 8934–8943.
- Models matter, so does training: An empirical study of cnns for optical flow estimation. IEEE transactions on pattern analysis and machine intelligence, 42(6): 1408–1423.
- Skflow: Learning optical flow with super kernels. Advances in Neural Information Processing Systems, 35: 11313–11326.
- Optical flow guided feature: A fast and robust motion representation for video action recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, 1390–1399.
- Raft: Recurrent all-pairs field transforms for optical flow. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part II 16, 402–419. Springer.
- Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th international conference on Machine learning, 1096–1103.
- Gmflow: Learning optical flow via global matching. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 8121–8130.
- Hierarchical discrete distribution decomposition for match density estimation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 6044–6053.
- Faster Segment Anything: Towards Lightweight SAM for Mobile Applications. arXiv preprint arXiv:2306.14289.
- Separable flow: Learning motion cost volumes for optical flow estimation. In Proceedings of the IEEE/CVF international conference on computer vision, 10807–10817.
- DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection. In The Eleventh International Conference on Learning Representations.
- Global matching with overlapping attention for optical flow estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 17592–17601.