Explicitly Disentangled Representations in Object-Centric Learning (2401.10148v1)
Abstract: Extracting structured representations from raw visual data is an important and long-standing challenge in machine learning. Recently, techniques for unsupervised learning of object-centric representations have raised growing interest. In this context, enhancing the robustness of the latent features can improve the efficiency and effectiveness of the training of downstream tasks. A promising step in this direction is to disentangle the factors that cause variation in the data. Previously, Invariant Slot Attention disentangled position, scale, and orientation from the remaining features. Extending this approach, we focus on separating the shape and texture components. In particular, we propose a novel architecture that biases object-centric models toward disentangling shape and texture components into two non-overlapping subsets of the latent space dimensions. These subsets are known a priori, hence before the training process. Experiments on a range of object-centric benchmarks reveal that our approach achieves the desired disentanglement while also numerically improving baseline performance in most cases. In addition, we show that our method can generate novel textures for a specific object or transfer textures between objects with distinct shapes.
- Object-centric image generation with factored depths, locations, and appearances. arXiv preprint arXiv:2004.00642, 2020.
- Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
- Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence, 35(8):1798–1828, 2013.
- Invariant slot attention: Object discovery with slot-centric reference frames. arXiv preprint arXiv:2302.04973, 2023.
- Monet: Unsupervised scene decomposition and representation. arXiv preprint arXiv:1901.11390, 2019.
- Object representations as fixed points: Training iterative refinement algorithms with implicit differentiation. Advances in Neural Information Processing Systems, 35:32694–32708, 2022.
- Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014.
- Spatially invariant unsupervised object detection with convolutional neural networks. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pp. 3412–3420, 2019.
- Generalization and robustness implications in object-centric learning. arXiv preprint arXiv:2107.00637, 2021.
- Genesis: Generative scene inference and sampling with object-centric latent representations. arXiv preprint arXiv:1907.13052, 2019.
- Genesis-v2: Inferring unordered object representations without iterative refinement. Advances in Neural Information Processing Systems, 34:8085–8094, 2021.
- Attend, infer, repeat: Fast scene understanding with generative models. Advances in neural information processing systems, 29, 2016.
- Tagger: Deep unsupervised perceptual grouping. Advances in Neural Information Processing Systems, 29, 2016.
- Neural expectation maximization. Advances in Neural Information Processing Systems, 30, 2017.
- Multi-object representation learning with iterative variational inference. In International Conference on Machine Learning, pp. 2424–2433. PMLR, 2019.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016.
- beta-vae: Learning basic visual concepts with a constrained variational framework. In International conference on learning representations, 2016.
- Arbitrary style transfer in real-time with adaptive instance normalization. In Proceedings of the IEEE international conference on computer vision, pp. 1501–1510, 2017.
- Comparing partitions. Journal of classification, 2:193–218, 1985.
- Improving object-centric learning with query optimization. In The Eleventh International Conference on Learning Representations, 2022.
- Scalor: Generative world models with scalable object representations. arXiv preprint arXiv:1910.02384, 2019.
- Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2901–2910, 2017.
- Multi-object datasets. https://github.com/deepmind/multi-object-datasets/, 2019.
- Clevrtex: A texture-rich benchmark for unsupervised multi-object segmentation. arXiv preprint arXiv:2111.10265, 2021.
- Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
- Conditional object-centric learning from video. arXiv preprint arXiv:2111.12594, 2021.
- Unsupervised conditional slot attention for object centric learning. arXiv preprint arXiv:2307.09437, 2023.
- Sequential attend, infer, repeat: Generative modelling of moving objects. Advances in Neural Information Processing Systems, 31, 2018.
- Space: Unsupervised object-oriented scene representation via spatial attention and decomposition. arXiv preprint arXiv:2001.02407, 2020.
- Object-centric learning with slot attention. Advances in Neural Information Processing Systems, 33:11525–11538, 2020.
- Unsupervised part-based disentangling of object shape and appearance. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10955–10964, 2019.
- Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025, 2015.
- Object-centric causal representation learning. In NeurIPS 2022 Workshop on Symmetry and Geometry in Neural Representations, 2022.
- dsprites: Disentanglement testing sprites dataset. https://github.com/deepmind/dsprites-dataset/, 2017.
- The multi-entity variational autoencoder. In NIPS Workshops, 2017.
- Inductive biases for object-centric representations in the presence of complex textures. arXiv preprint arXiv:2204.08479, 2022.
- Disentangling 3d prototypical networks for few-shot concept learning. arXiv preprint arXiv:2011.03367, 2020.
- William M Rand. Objective criteria for the evaluation of clustering methods. Journal of the American Statistical association, 66(336):846–850, 1971.
- Deforming autoencoders: Unsupervised disentangling of shape and appearance. In Proceedings of the European conference on computer vision (ECCV), pp. 650–665, 2018.
- Illiterate dall-e learns to compose. arXiv preprint arXiv:2110.11405, 2021.
- Neural systematic binder. In The Eleventh International Conference on Learning Representations, 2022.
- A 3x3 isotropic gradient operator for image processing. a talk at the Stanford Artificial Project in, pp. 271–272, 1968.
- Faster attend-infer-repeat with tractable probabilistic models. In International Conference on Machine Learning, pp. 5966–5975. PMLR, 2019.
- Relational neural expectation maximization: Unsupervised discovery of objects and their interactions. arXiv preprint arXiv:1802.10353, 2018.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Spatial broadcast decoder: A simple architecture for learning disentangled representations in vaes. arXiv preprint arXiv:1901.07017, 2019.
- Toward intelligent fashion design: A texture and shape disentangled generative adversarial network. ACM Transactions on Multimedia Computing, Communications and Applications, 19(3):1–23, 2023.