Cycle Consistency Driven Object Discovery (2306.02204v2)
Abstract: Developing deep learning models that effectively learn object-centric representations, akin to human cognition, remains a challenging task. Existing approaches facilitate object discovery by representing objects as fixed-size vectors, called slots'' or
object files''. While these approaches have shown promise in certain scenarios, they still exhibit certain limitations. First, they rely on architectural priors which can be unreliable and usually require meticulous engineering to identify the correct objects. Second, there has been a notable gap in investigating the practical utility of these representations in downstream tasks. To address the first limitation, we introduce a method that explicitly optimizes the constraint that each object in a scene should be associated with a distinct slot. We formalize this constraint by introducing consistency objectives which are cyclic in nature. By integrating these consistency objectives into various existing slot-based object-centric methods, we showcase substantial improvements in object-discovery performance. These enhancements consistently hold true across both synthetic and real-world scenes, underscoring the effectiveness and adaptability of the proposed approach. To tackle the second limitation, we apply the learned object-centric representations from the proposed method to two downstream reinforcement learning tasks, demonstrating considerable performance enhancements compared to conventional slot-based and monolithic representation learning methods. Our results suggest that the proposed approach not only improves object discovery, but also provides richer features for downstream tasks.
- Causalworld: A robotic manipulation benchmark for causal structure and transfer learning, 2020.
- Object discovery from motion-guided tokens, 2023.
- Monet: Unsupervised scene decomposition and representation. CoRR, abs/1901.11390, 2019. URL http://arxiv.org/abs/1901.11390.
- Emerging properties in self-supervised vision transformers. CoRR, abs/2104.14294, 2021. URL https://arxiv.org/abs/2104.14294.
- Object representations as fixed points: Training iterative refinement algorithms with implicit differentiation, 2023.
- Decision transformer: Reinforcement learning via sequence modeling, 2021.
- X. Chen and K. He. Exploring simple siamese representation learning. CoRR, abs/2011.10566, 2020. URL https://arxiv.org/abs/2011.10566.
- CANZSL: cycle-consistent adversarial networks for zero-shot learning from natural language. CoRR, abs/1909.09822, 2019. URL http://arxiv.org/abs/1909.09822.
- Unsupervised part discovery from contrastive reconstruction. CoRR, abs/2111.06349, 2021. URL https://arxiv.org/abs/2111.06349.
- E. Crawford and J. Pineau. Spatially invariant unsupervised object detection with convolutional neural networks. In Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence and Thirty-First Innovative Applications of Artificial Intelligence Conference and Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, AAAI’19/IAAI’19/EAAI’19. AAAI Press, 2019. ISBN 978-1-57735-809-1. doi: 10.1609/aaai.v33i01.33013412. URL https://doi.org/10.1609/aaai.v33i01.33013412.
- Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proc. Computer Vision and Pattern Recognition (CVPR), IEEE, 2017.
- Generalization and robustness implications in object-centric learning. In International Conference on Machine Learning, 2022.
- Temporal cycle-consistency learning. CoRR, abs/1904.07846, 2019. URL http://arxiv.org/abs/1904.07846.
- SAVi++: Towards end-to-end object-centric learning from real-world videos. In A. H. Oh, A. Agarwal, D. Belgrave, and K. Cho, editors, Advances in Neural Information Processing Systems, 2022.
- Genesis: Generative scene inference and sampling with object-centric latent representations. arXiv preprint arXiv:1907.13052, 2019.
- Attend, infer, repeat: Fast scene understanding with generative models. In D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016.
- The pascal visual object classes challenge: A retrospective. International Journal of Computer Vision, 111:98–136, 2014.
- Movi: A large multipurpose motion and video dataset. CoRR, abs/2003.01888, 2020. URL https://arxiv.org/abs/2003.01888.
- Recurrent independent mechanisms. arXiv preprint arXiv:1909.10893, 2019.
- Object files and schemata: Factorizing declarative and procedural knowledge in dynamical systems. arXiv preprint arXiv:2006.16225, 2020.
- Coordination among neural modules through a shared global workspace. arXiv preprint arXiv:2103.01197, 2021a.
- Neural production systems. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. W. Vaughan, editors, Advances in Neural Information Processing Systems, 2021b.
- Neural expectation maximization. CoRR, abs/1708.03498, 2017. URL http://arxiv.org/abs/1708.03498.
- Multi-object representation learning with iterative variational inference. CoRR, abs/1903.00450, 2019. URL http://arxiv.org/abs/1903.00450.
- Kubric: A scalable dataset generator. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3739–3751, 2022.
- Bootstrap your own latent - a new approach to self-supervised learning. In H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 21271–21284. Curran Associates, Inc., 2020.
- Shapestacks: Learning vision-based physical intuition for generalised object stacking. ArXiv, abs/1804.08018, 2018.
- Representation learning via global temporal alignment and cycle-consistency. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11068–11077, June 2021.
- Momentum contrast for unsupervised visual representation learning. CoRR, abs/1911.05722, 2019. URL http://arxiv.org/abs/1911.05722.
- CyCADA: Cycle-consistent adversarial domain adaptation. In J. Dy and A. Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 1989–1998. PMLR, 10–15 Jul 2018. URL https://proceedings.mlr.press/v80/hoffman18a.html.
- Neural topic modeling with cycle-consistent adversarial training. CoRR, abs/2009.13971, 2020. URL https://arxiv.org/abs/2009.13971.
- Cycle-consistent adversarial autoencoders for unsupervised text style transfer. CoRR, abs/2010.00735, 2020. URL https://arxiv.org/abs/2010.00735.
- L. J. Hubert and P. Arabie. Comparing partitions. Journal of Classification, 2:193–218, 1985.
- Space-time correspondence as a contrastive random walk. CoRR, abs/2006.14613, 2020. URL https://arxiv.org/abs/2006.14613.
- Improving object-centric learning with query optimization, 2022. URL https://arxiv.org/abs/2210.08990.
- Multi-object datasets. https://github.com/deepmind/multi-object-datasets/, 2019.
- Clevrtex: A texture-rich benchmark for unsupervised multi-object segmentation. ArXiv, abs/2111.10265, 2021.
- Unsupervised multi-object segmentation by predicting probable motion patterns, 2022.
- Systematic evaluation of causal discovery in visual model based reinforcement learning. arXiv preprint arXiv:2107.00848, 2021.
- Novel dataset for fine-grained image categorization : Stanford dogs. 2012.
- Conditional object-centric learning from video. CoRR, abs/2111.12594, 2021. URL https://arxiv.org/abs/2111.12594.
- 3d object representations for fine-grained categorization. In 4th International IEEE Workshop on 3D Representation and Recognition (3dRR-13), Sydney, Australia, 2013.
- Z. Lai and W. Xie. Self-supervised learning for video correspondence flow. CoRR, abs/1905.00875, 2019. URL http://arxiv.org/abs/1905.00875.
- S. Lee and M. Lee. Type-dependent prompt CycleQAG : Cycle consistency for multi-hop question generation. In Proceedings of the 29th International Conference on Computational Linguistics, pages 6301–6314, Gyeongju, Republic of Korea, Oct. 2022. International Committee on Computational Linguistics. URL https://aclanthology.org/2022.coling-1.549.
- Joint-task self-supervised learning for temporal correspondence. CoRR, abs/1909.11895, 2019. URL http://arxiv.org/abs/1909.11895.
- Microsoft coco: Common objects in context. In European Conference on Computer Vision, 2014.
- SPACE: unsupervised object-oriented scene representation via spatial attention and decomposition. CoRR, abs/2001.02407, 2020. URL http://arxiv.org/abs/2001.02407.
- Object-centric learning with slot attention. In H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 11525–11538. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper/2020/file/8511df98c02ab60aea1b2356c013bc0f-Paper.pdf.
- M.-E. Nilsback and A. Zisserman. A visual vocabulary for flower classification. In IEEE Conference on Computer Vision and Pattern Recognition, volume 2, pages 1447–1454, 2006.
- Zero-shot text-to-image generation. CoRR, abs/2102.12092, 2021. URL https://arxiv.org/abs/2102.12092.
- Proximal policy optimization algorithms. CoRR, abs/1707.06347, 2017. URL http://arxiv.org/abs/1707.06347.
- Bridging the gap to real-world object-centric learning. In The Eleventh International Conference on Learning Representations, 2023.
- Cycle-consistency for robust visual question answering. CoRR, abs/1902.05660, 2019. URL http://arxiv.org/abs/1902.05660.
- Illiterate DALL-e learns to compose. In International Conference on Learning Representations, 2022.
- Neural discrete representation learning. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper/2017/file/7a98af17e63a0ac09ce2e96d03992fbc-Paper.pdf.
- Caltech-ucsd birds 200. Technical Report CNS-TR-2011-001, California Institute of Technology, 2011.
- Image co-segmentation via consistent functional maps. In 2013 IEEE International Conference on Computer Vision, pages 849–856, 2013. doi: 10.1109/ICCV.2013.110.
- Unsupervised multi-class joint image segmentation. In 2014 IEEE Conference on Computer Vision and Pattern Recognition, pages 3142–3149, 2014. doi: 10.1109/CVPR.2014.402.
- Counterfactual cycle-consistent learning for instruction following and generation in vision-language navigation. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15450–15460, 2022a.
- Unsupervised deep tracking. CoRR, abs/1904.01828, 2019. URL http://arxiv.org/abs/1904.01828.
- Self-supervised transformers for unsupervised object discovery using normalized cut. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14523–14533, 2022b.
- Object-centric learning with cyclic walks between parts and whole, 2023.
- Spatial broadcast decoder: A simple architecture for learning disentangled representations in vaes. CoRR, abs/1901.07017, 2019. URL http://arxiv.org/abs/1901.07017.
- K. Wilson and N. Snavely. Network principles for sfm: Disambiguating repeated structures with local context. In 2013 IEEE International Conference on Computer Vision, pages 513–520, 2013. doi: 10.1109/ICCV.2013.69.
- An investigation into pre-training object-centric representations for reinforcement learning, 2023.
- Disambiguating visual relations using loop constraints. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 1426–1433, 2010. doi: 10.1109/CVPR.2010.5539801.
- Flowweb: Joint image set alignment by weaving consistent, pixel-wise correspondences. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1191–1200, 2015a. doi: 10.1109/CVPR.2015.7298723.
- Learning dense correspondence via 3d-guided cycle consistency. CoRR, abs/1604.05383, 2016. URL http://arxiv.org/abs/1604.05383.
- Multi-image matching via fast alternating minimization. In 2015 IEEE International Conference on Computer Vision (ICCV), pages 4032–4040, 2015b. doi: 10.1109/ICCV.2015.459.
- Unpaired image-to-image translation using cycle-consistent adversarial networks. CoRR, abs/1703.10593, 2017. URL http://arxiv.org/abs/1703.10593.
- Parts: Unsupervised segmentation with slots, attention and independence maximization. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 10419–10427, 2021. doi: 10.1109/ICCV48922.2021.01027.
- Aniket Didolkar (15 papers)
- Anirudh Goyal (93 papers)
- Yoshua Bengio (601 papers)