Leveraging Image Augmentation for Object Manipulation: Towards Interpretable Controllability in Object-Centric Learning (2310.08929v4)
Abstract: The binding problem in artificial neural networks is actively explored with the goal of achieving human-level recognition skills through the comprehension of the world in terms of symbol-like entities. Especially in the field of computer vision, object-centric learning (OCL) is extensively researched to better understand complex scenes by acquiring object representations or slots. While recent studies in OCL have made strides with complex images or videos, the interpretability and interactivity over object representation remain largely uncharted, still holding promise in the field of OCL. In this paper, we introduce a novel method, Slot Attention with Image Augmentation (SlotAug), to explore the possibility of learning interpretable controllability over slots in a self-supervised manner by utilizing an image augmentation strategy. We also devise the concept of sustainability in controllable slots by introducing iterative and reversible controls over slots with two proposed submethods: Auxiliary Identity Manipulation and Slot Consistency Loss. Extensive empirical studies and theoretical validation confirm the effectiveness of our approach, offering a novel capability for interpretable and sustainable control of object representations.
- Baldi, P. Autoencoders, unsupervised learning, and deep architectures. In Proceedings of ICML workshop on unsupervised and transfer learning, pp. 37–49. JMLR Workshop and Conference Proceedings, 2012.
- Invariant slot attention: Object discovery with slot-centric reference frames. arXiv preprint arXiv:2302.04973, 2023.
- Causal contraction: spatial binding in the perception of collision events. Psychological Science, 21(1):44–48, 2010.
- Monet: Unsupervised scene decomposition and representation. arXiv preprint arXiv:1901.11390, 2019.
- Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325, 2015.
- Unsupervised object discovery and localization in the wild: Part-based matching with bottom-up region proposals. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1201–1210, 2015.
- An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2020.
- Savi++: Towards end-to-end object-centric learning from real-world videos. Conference on Neural Information Processing Systems (NeurIPS), 35:28940–28954, 2022.
- Genesis: Generative scene inference and sampling with object-centric latent representations. arXiv preprint arXiv:1907.13052, 2019.
- Genesis-v2: Inferring unordered object representations without iterative refinement. Conference on Neural Information Processing Systems (NeurIPS), 34:8085–8094, 2021.
- Feldman, J. The neural binding problem (s). Cognitive neurodynamics, 7:1–11, 2013.
- The representation and matching of pictorial structures. IEEE Transactions on computers, 100(1):67–92, 1973.
- Tagger: Deep unsupervised perceptual grouping. Conference on Neural Information Processing Systems (NeurIPS), 29, 2016.
- Multi-object representation learning with iterative variational inference. In International Conference on Machine Learning (ICML), pp. 2424–2433. PMLR, 2019.
- On the binding problem in artificial neural networks. arXiv preprint arXiv:2012.05208, 2020.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016.
- Masked autoencoders are scalable vision learners. In CVPR, 2022.
- Ptr: A benchmark for part-based conceptual, relational, and physical reasoning. NeurIPS, 2021.
- Denoising criterion for variational auto-encoding framework. In Proceedings of the AAAI conference on artificial intelligence, volume 31, 2017.
- Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2901–2910, 2017.
- Clevrtex: A texture-rich benchmark for unsupervised multi-object segmentation. NeurIPS, 2021.
- Shepherding slots to objects: Towards stable and robust object-centric learning. arXiv preprint arXiv:2303.17842, 2023.
- Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
- Conditional object-centric learning from video. arXiv preprint arXiv:2111.12594, 2021.
- Kuhn, H. W. The hungarian method for the assignment problem. Naval research logistics quarterly, 1955.
- Building machines that learn and think like people. Behavioral and brain sciences, 40:e253, 2017.
- Learning object-centric representations of multi-object scenes from multiple views. Conference on Neural Information Processing Systems (NeurIPS), 33:5656–5666, 2020.
- Object-centric learning with slot attention. Conference on Neural Information Processing Systems (NeurIPS), 33:11525–11538, 2020.
- Decoupled weight decay regularization. In International Conference on Learning Representations (ICLR), 2019.
- Adversarial autoencoders. arXiv preprint arXiv:1511.05644, 2015.
- The neuro-symbolic concept learner: Interpreting scenes, words, and sentences from natural supervision. arXiv preprint arXiv:1904.12584, 2019.
- Marr, D. Vision: A computational investigation into the human representation and processing of visual information. MIT press, 2010.
- Multi-object datasets. https://github.com/deepmind/multi_object_datasets/, 2019.
- Object scene representation transformer. Conference on Neural Information Processing Systems (NeurIPS), 35:9512–9524, 2022a.
- Scene representation transformer: Geometry-free novel view synthesis through set-latent scene representations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6229–6238, 2022b.
- Bridging the gap to real-world object-centric learning. arXiv preprint arXiv:2209.14860, 2022.
- Illiterate dall-e learns to compose. arXiv preprint arXiv:2110.11405, 2021.
- Simple unsupervised object-centric learning for complex and naturalistic videos. Conference on Neural Information Processing Systems (NeurIPS), 35:18181–18196, 2022.
- Neural systematic binder. In International Conference on Learning Representations (ICLR), 2023.
- Unsupervised discovery and composition of object light fields. arXiv preprint arXiv:2205.03923, 2022.
- Core knowledge. Developmental science, 10(1):89–96, 2007.
- Treisman, A. The binding problem. Current opinion in neurobiology, 6(2):171–178, 1996.
- Visualizing data using t-sne. Journal of machine learning research, 9(11), 2008.
- Attention is all you need. Conference on Neural Information Processing Systems (NeurIPS), 30, 2017.
- The all-seeing project: Towards panoptic visual recognition and understanding of the open world. arXiv preprint arXiv:2308.01907, 2023a.
- Slot-vae: Object-centric scene generation with slot attention. In International Conference on Machine Learning (ICML), 2023b.
- Spatial broadcast decoder: A simple architecture for learning disentangled representations in vaes. arXiv preprint arXiv:1901.07017, 2019.
- Slotformer: Unsupervised visual dynamics simulation with object-centric models. arXiv preprint arXiv:2210.05861, 2022.
- Coat: Measuring object compositionality in emergent representations. In International Conference on Machine Learning (ICML), pp. 24388–24413. PMLR, 2022.