Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
125 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Rethinking Amodal Video Segmentation from Learning Supervised Signals with Object-centric Representation (2309.13248v1)

Published 23 Sep 2023 in cs.CV

Abstract: Video amodal segmentation is a particularly challenging task in computer vision, which requires to deduce the full shape of an object from the visible parts of it. Recently, some studies have achieved promising performance by using motion flow to integrate information across frames under a self-supervised setting. However, motion flow has a clear limitation by the two factors of moving cameras and object deformation. This paper presents a rethinking to previous works. We particularly leverage the supervised signals with object-centric representation in \textit{real-world scenarios}. The underlying idea is the supervision signal of the specific object and the features from different views can mutually benefit the deduction of the full mask in any specific frame. We thus propose an Efficient object-centric Representation amodal Segmentation (EoRaS). Specially, beyond solely relying on supervision signals, we design a translation module to project image features into the Bird's-Eye View (BEV), which introduces 3D information to improve current feature quality. Furthermore, we propose a multi-view fusion layer based temporal module which is equipped with a set of object slots and interacts with features from different views by attention mechanism to fulfill sufficient object representation completion. As a result, the full mask of the object can be decoded from image features updated by object slots. Extensive experiments on both real-world and synthetic benchmarks demonstrate the superiority of our proposed method, achieving state-of-the-art performance. Our code will be released at \url{https://github.com/kfan21/EoRaS}.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (36)
  1. The lovász-softmax loss: A tractable surrogate for the optimization of the intersection-over-union measure in neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4413–4421, 2018.
  2. Monet: Unsupervised scene decomposition and representation. arXiv preprint arXiv:1901.11390, 2019.
  3. End-to-end object detection with transformers. In European conference on computer vision, pages 213–229. Springer, 2020.
  4. Masked-attention mask transformer for universal image segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1290–1299, 2022.
  5. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
  6. Restricted deformable convolution-based road scene semantic segmentation using surround view cameras. IEEE Transactions on Intelligent Transportation Systems, 21(10):4350–4362, 2019.
  7. Solq: Segmenting objects by learning queries. Advances in Neural Information Processing Systems, 34:21898–21909, 2021.
  8. Google scanned objects: A high-quality dataset of 3d scanned household items. In 2022 International Conference on Robotics and Automation (ICRA), pages 2553–2560. IEEE, 2022.
  9. Instances as queries. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6910–6919, 2021.
  10. Vision meets robotics: The kitti dataset. International Journal of Robotics Research (IJRR), 2013.
  11. Kubric: a scalable dataset generator. 2022.
  12. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
  13. Istr: End-to-end instance segmentation with transformers. arXiv preprint arXiv:2105.00637, 2021.
  14. Learning vector quantized shape code for amodal blastomere instance segmentation. arXiv preprint arXiv:2012.00985, 2020.
  15. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2901–2910, 2017.
  16. Deep occlusion-aware instance segmentation with overlapping bilayers. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4019–4028, 2021.
  17. Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. In European conference on computer vision, pages 1–18. Springer, 2022.
  18. Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2117–2125, 2017.
  19. Object-centric learning with slot attention. Advances in Neural Information Processing Systems, 33:11525–11538, 2020.
  20. Monocular semantic occupancy grid mapping with convolutional variational encoder–decoder networks. IEEE Robotics and Automation Letters, 4(2):445–452, 2019.
  21. Cross-view semantic segmentation for sensing surroundings. IEEE Robotics and Automation Letters, 5(3):4867–4873, 2020.
  22. Amodal instance segmentation with kins dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3014–3023, 2019.
  23. Predicting semantic map representations from images using pyramid occupancy networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11138–11147, 2020.
  24. Translating images into maps. In 2022 International Conference on Robotics and Automation (ICRA), pages 9200–9206. IEEE, 2022.
  25. Object scene representation transformer. arXiv preprint arXiv:2206.06922, 2022.
  26. Efficient semantic segmentation for visual bird’s-eye view interpretation. In International Conference on Intelligent Autonomous Systems, pages 679–688. Springer, 2018.
  27. Learning to look around objects for top-view representations of outdoor scenes. In Proceedings of the European Conference on Computer Vision (ECCV), pages 787–802, 2018.
  28. Seeing 3d objects in a single image via self-supervised static-dynamic disentanglement. arXiv preprint arXiv:2207.11232, 2022.
  29. Amodal segmentation through out-of-task and out-of-distribution generalization with a bayesian model. cvpr. 2022. In IEEE Conference on Computer Vision and Pattern Recognition, 2022.
  30. Aisformer: Amodal instance segmentation with transformer. arXiv preprint arXiv:2210.06323, 2022.
  31. Image parsing: Unifying segmentation, detection, and recognition. In IJCV, 2015.
  32. Amodal segmentation based on visible region segmentation and shape prior. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 2995–3003, 2021.
  33. Segment as points for efficient online multi-object tracking and segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), 2020.
  34. Universal instance perception as object discovery and retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15325–15336, 2023.
  35. Self-supervised amodal video object segmentation. arXiv preprint arXiv:2210.12733, 2022.
  36. Self-supervised scene de-occlusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3784–3792, 2020.
Citations (2)

Summary

We haven't generated a summary for this paper yet.