SPOT: Self-Training with Patch-Order Permutation for Object-Centric Learning with Autoregressive Transformers (2312.00648v3)
Abstract: Unsupervised object-centric learning aims to decompose scenes into interpretable object entities, termed slots. Slot-based auto-encoders stand out as a prominent method for this task. Within them, crucial aspects include guiding the encoder to generate object-specific slots and ensuring the decoder utilizes them during reconstruction. This work introduces two novel techniques, (i) an attention-based self-training approach, which distills superior slot-based attention masks from the decoder to the encoder, enhancing object segmentation, and (ii) an innovative patch-order permutation strategy for autoregressive transformers that strengthens the role of slot vectors in reconstruction. The effectiveness of these strategies is showcased experimentally. The combined approach significantly surpasses prior slot-based autoencoder methods in unsupervised object segmentation, especially with complex real-world images. We provide the implementation code at https://github.com/gkakogeorgiou/spot .
- Self-supervised object-centric learning for videos. In NeurIPS, 2023.
- Towards self-supervised learning of global and object-centric representations. In ICLRW, 2022.
- Discorying object that can move. In CVPR, 2022.
- Object discovery from motion-guided tokens. In CVPR, 2023.
- Language models are few-shot learners. In NeurIPS, 2020.
- Monet: Unsupervised scene decomposition and representation. arXiv preprint arXiv:1901.11390, 2019.
- Emerging properties in self-supervised vision transformers. In ICCV, 2021.
- Object representations as fixed points: Training iterative refinement algorithms with implicit differentiation. In NeurIPs, 2022.
- Generative pretraining from pixels. In ICML, 2020.
- An empirical study of training self-supervised vision transformers. In ICCV, 2021.
- An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.
- Learning from future: A novel self-training framework for semantic segmentation. In NeurIPS, 2022.
- SAVi++: Towards end-to-end object-centric learning from real-world videos. In NeurIPS, 2022.
- Genesis: Generative scene inference and sampling with object-centric latent representations. In ICLR, 2020.
- Attend, infer, repeat: Fast scene understanding with generative models. NeurIPs, 2016.
- The pascal visual object classes (voc) challenge. International journal of computer vision, 2009.
- Multi-object representation learning with iterative variational inference. In ICML, 2019.
- Kubric: A scalable dataset generator. In CVPR, 2022.
- Unsupervised semantic segmentation by distilling feature correspondences. In ICLR, 2022.
- Semantic contours from inverse detectors. In ICCV, 2011.
- Masked autoencoders are scalable vision learners. In CVPR, 2022.
- Efficient visual pretraining with contrastive detection. In ICCV, 2021.
- Object discovery and representation networks. In ECCV, 2022.
- Dorsal: Diffusion for object-centric representations of scenes et al. arXiv preprint arXiv:2306.08068, 2023.
- Invariant information clustering for unsupervised image classification and segmentation. In ICCV, 2019.
- Improving object-centric learning with query optimization. In ICLR, 2022.
- Object-centric slot diffusion. In NeurIPS, 2023.
- Clevrtex: A texture-rich benchmark for unsupervised multi-object segmentation. In NeurIPS Datasets and Benchmarks Track, 2021.
- Shepherding slots to objects: Towards stable and robust object-centric learning. In CVPR, 2023.
- Adam: A method for stochastic optimization. In ICLR, 2015.
- Conditional object-centric learning from video. In ICLR, 2022.
- Unsupervised conditional slot attention for object centric learning. arXiv preprint arXiv:2307.09437, 2023.
- Harold W Kuhn. The hungarian method for the assignment problem. Naval Research Logistics Quarterly, 2(1-2):83–97, 1955.
- Microsoft coco: Common objects in context. In ECCV, 2014.
- Space: Unsupervised object-oriented scene representation via spatial attention and decomposition. In ICLR, 2020.
- Cycle self-training for domain adaptation. In NeurIPS, 2021.
- Object-centric learning with slot attention. In NeurIPS, 2020.
- Learning object-centric video models by contrasting sets. arXiv preprint arXiv:2011.10287, 2020.
- Complex-valued autoencoders for object discovery. Transactions on Machine Learning Research, 2022.
- Rotating features for object discovery. NeurIPs, 2023.
- Instance adaptive self-training for unsupervised domain adaptation. In ECCV, 2020.
- Deep spectral methods: A surprisingly strong baseline for unsupervised semantic segmentation and localization. In CVPR, 2022.
- Scaling open-vocabulary object detection, 2023.
- Unsupervised Layered Image Decomposition into Object Prototypes. In ICCV, 2021.
- Bridging the gap to real-world object-centric learning. In ICLR, 2023.
- Localizing objects with self-supervised transformers and no labels. In BMVC, 2021.
- Unsupervised object localization: Observing the background to discover objects. In CVPR, 2023.
- Illiterate dall-e learns to compose. In ICLR, 2022a.
- Simple unsupervised object-centric learning for complex and naturalistic videos. In NeurIPS, 2022b.
- Neural systematic binder. In ICLR, 2023.
- Exploring the role of the bottleneck in slot-based models through covariance regularization. arXiv preprint arXiv:2306.02577, 2023.
- Llama 2: Open foundation and fine-tuned chat models, 2023.
- Learning what and where: Disentangling location and identity tracking without supervision. In ICLR, 2023.
- Image captioners are scalable vision learners too. In NeurIPs, 2023.
- Unsupervised semantic segmentation by contrasting object mask proposals. In ICCV, 2021.
- Adaptive self-training for object detection. In ICCV Workshop, 2023.
- Attention is all you need. In NeurIPS, 2017.
- Cut and learn for unsupervised object detection and instance segmentation. arXiv preprint arXiv:2301.11320, 2023.
- Spatial broadcast decoder: A simple architecture for disentangled representations in VAEs. In ICLR workshops, 2019a.
- Spatial broadcast decoder: A simple architecture for disentangled representations in VAEs. In ICLR workshops, 2019b.
- Crest: A class-rebalancing self-training framework for imbalanced semi-supervised learning. In CVPR, 2021.
- Benchmarking unsupervised object representations for video sequences. Journal of Machine Learning Research, 2021.
- Self-supervised visual representation learning with semantic grouping. NeurIPs, 2022.
- A learning algorithm for continually running fully recurrent neural networks. Neural computation, 1(2):270–280, 1989.
- Slotformer: Unsupervised visual dynamics simulation with object-centric models. In ICLR, 2023a.
- Slotdiffusion: Object-centric generative modeling with diffusion models. In NeurIPS, 2023b.
- Groupvit: Semantic segmentation emerges from text supervision. In CVPR, 2022.
- Billion-scale semi-supervised learning for image classification. arXiv preprint arXiv:1905.00546, 2019.
- St++: Make self-training work better for semi-supervised semantic segmentation. In CVPR, 2022.
- Interactive self-training with mean teachers for semi-supervised object detection. In CVPR, 2021.
- Promising or elusive? unsupervised object segmentation from real-world single images. In NeurIPS, 2022.
- Xlnet: Generalized autoregressive pretraining for language understanding. In NeurIPs, 2019.
- Object-centric learning for real-world videos by predicting temporal feature similarities. In NeurIPS, 2023.
- Improving semantic segmentation via efficient self-training. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021.
- Unsupervised domain adaptation for semantic segmentation via class-balanced self-training. In ECCV, 2018.