Boosting Object Representation Learning via Motion and Object Continuity
Abstract: Recent unsupervised multi-object detection models have shown impressive performance improvements, largely attributed to novel architectural inductive biases. Unfortunately, they may produce suboptimal object encodings for downstream tasks. To overcome this, we propose to exploit object motion and continuity, i.e., objects do not pop in and out of existence. This is accomplished through two mechanisms: (i) providing priors on the location of objects through integration of optical flow, and (ii) a contrastive object continuity loss across consecutive image frames. Rather than developing an explicit deep architecture, the resulting Motion and Object Continuity (MOC) scheme can be instantiated using any baseline object detection model. Our results show large improvements in the performances of a SOTA model in terms of object discovery, convergence speed and overall latent object representations, particularly for playing Atari games. Overall, we show clear benefits of integrating motion and object continuity for downstream tasks, moving beyond object representation learning based only on reconstruction.
- Unsupervised state representation learning in atari. In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d’Alché-Buc, Emily B. Fox, and Roman Garnett, editors, Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019 (NeurIPS), 2019.
- An overview of optical flow-based approaches for motion segmentation. The Imaging Science Journal, 2019.
- Deep equilibrium optical flow estimation. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
- Discovering objects that can move. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
- Gary Bradski. The OpenCV Library. Dr. Dobb’s Journal of Software Tools, 2000.
- Openai gym. CoRR, 2016.
- Monet: Unsupervised scene decomposition and representation. CoRR, 2019.
- Temporal hockey action recognition via pose and optical flows. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2019.
- A simple framework for contrastive learning of visual representations. In Proceedings of the 37th International Conference on Machine Learning (ICML), 2020.
- Exploiting spatial invariance for scalable unsupervised object tracking. In The Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI), 2020.
- Ocatari: Object-centric atari 2600 reinforcement learning environments. CoRR, 2021.
- Adaptive rational activations to boost deep reinforcement learning. In International Conference on Learning Representations (ICLR), 2024.
- Interpretable and explainable logical policies via neurally guided symbolic abstraction. In Advances in Neural Information Processing (NeurIPS), 2023.
- Interpretable concept bottlenecks to align reinforcement learning agents. ArXiv, 2024.
- Mushroomrl: Simplifying reinforcement learning research. J. Mach. Learn. Res., 2021.
- Flownet: Learning optical flow with convolutional networks. In 2015 IEEE International Conference on Computer Vision (ICCV), 2015.
- Unsupervised discovery of 3d physical objects from video. In 9th International Conference on Learning Representations (ICLR), 2021.
- Savi++: Towards end-to-end object-centric learning from real-world videos. CoRR, 2022.
- GENESIS: generative scene inference and sampling with object-centric latent representations. In 8th International Conference on Learning Representations (ICLR), 2020.
- Attend, infer, repeat: Fast scene understanding with generative models. In Daniel D. Lee, Masashi Sugiyama, Ulrike von Luxburg, Isabelle Guyon, and Roman Garnett, editors, Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016 (NeurIPS), 2016.
- Gunnar Farnebäck. Two-frame motion estimation based on polynomial expansion. In Josef Bigün and Tomas Gustavsson, editors, 13th Scandinavian Conference on Image Analysis 2003 (SCIA), 2003.
- Optical flow estimation. In Nikos Paragios, Yunmei Chen, and Olivier D. Faugeras, editors, Handbook of Mathematical Models in Computer Vision. 2006.
- Unsupervised video object segmentation for deep reinforcement learning. In Samy Bengio, Hanna M. Wallach, Hugo Larochelle, Kristen Grauman, Nicolò Cesa-Bianchi, and Roman Garnett, editors, Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018 (NeurIPS), 2018.
- Multi-object representation learning with iterative variational inference. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning (ICML), 2019.
- On the binding problem in artificial neural networks. CoRR, 2020.
- Momentum contrast for unsupervised visual representation learning. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
- Deep residual learning for image recognition. In 2016 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
- Olivier J. Hénaff. Data-efficient image recognition with contrastive predictive coding. In Proceedings of the 37th International Conference on Machine Learning (ICML), 2020.
- Ridge regression: applications to nonorthogonal problems. Technometrics, 1970.
- Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 2000.
- Optical flow estimation in the deep learning age. CoRR, 2020.
- Flownet 2.0: Evolution of optical flow estimation with deep networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
- Imposing consistency for optical flow estimation. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
- Generative neurosymbolic machines. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin, editors, Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020 (NeurIPS), 2020.
- SCALOR: generative world models with scalable object representations. In 8th International Conference on Learning Representations (ICLR), 2020.
- Moving object tracking using optical flow and motion vector estimation. In 4th International Conference on Reliability, Infocom Technologies and Optimization (ICRITO) (Trends and Future Directions), 2015.
- Symbols as a lingua franca for bridging human-ai chasm for explainable and advisable AI systems. In Thirty-Sixth AAAI Conference on Artificial Intelligence (AAAI), 2022.
- Conditional object-centric learning from video. In The Tenth International Conference on Learning Representations (ICLR), 2022.
- Concept bottleneck models. In Proceedings of the 37th International Conference on Machine Learning (ICML), 2020.
- Sequential attend, infer, repeat: Generative modelling of moving objects. In Samy Bengio, Hanna M. Wallach, Hugo Larochelle, Kristen Grauman, Nicolò Cesa-Bianchi, and Roman Garnett, editors, Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018 (NeurIPS), 2018.
- Alex Krizhevsky. One weird trick for parallelizing convolutional neural networks. CoRR, 2014.
- CURL: contrastive unsupervised representations for reinforcement learning. In Proceedings of the 37th International Conference on Machine Learning (ICML), 2020.
- Motion feature network: Fixed motion filter for action recognition. In Vittorio Ferrari, Martial Hebert, Cristian Sminchisescu, and Yair Weiss, editors, 15th European Conference on Computer Vision 2018 (ECCV), 2018.
- Dynamic warping network for semantic video segmentation. Complex., 2021.
- SPACE: unsupervised object-oriented scene representation via spatial attention and decomposition. In 8th International Conference on Learning Representations (ICLR), 2020.
- Efficient semantic video segmentation with per-frame inference. In Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm, editors, 16th European Conference on Computer Vision 2020 (ECCV), 2020.
- Object-centric learning with slot attention. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin, editors, Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020.
- Upflow: Upsampling pyramid for unsupervised optical flow learning. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
- Information theoretic measures for clusterings comparison: Variants, properties, normalization and correction for chance. J. Mach. Learn. Res., 2010.
- Semantic relational object tracking. IEEE Transactions on Cognitive and Developmental Systems, 12(1):84–97, 2020.
- Adjusting for chance clustering comparison measures. J. Mach. Learn. Res., 2016.
- Toward open set recognition. IEEE Trans. Pattern Anal. Mach. Intell., 2013.
- Unsupervised object discovery and segmentation in videos. In Tilo Burghardt, Dima Damen, Walterio W. Mayol-Cuevas, and Majid Mirmehdi, editors, British Machine Vision Conference (BMVC), 2013.
- Bridging the gap to real-world object-centric learning. CoRR, 2022.
- Illiterate DALL-E learns to compose. In ”10th International Conference on Learning Representations (ICLR)”, 2022.
- Marionette: Self-supervised sprite learning. In Marc’Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan, editors, Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021 (NeurIPS), 2021.
- Right for the right concept: Revising neuro-symbolic concepts by interacting with their explanations. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
- Faster attend-infer-repeat with tractable probabilistic models. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning (ICML), 2019.
- Smurf: Self-teaching multi-frame unsupervised raft with full-image warping. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
- The principles of object continuity and solidity in adult vision: Some discrepancies in performance. Journal of Vision, 2015.
- Unsupervised object learning via common fate. CoRR, 2021.
- Raft: Recurrent all-pairs field transforms for optical flow. In ”16th European Conference on Computer Vision 2018 (ECCV)”, 2020.
- Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. J. Mach. Learn. Res., 2008.
- Deep reinforcement learning with double q-learning. In Dale Schuurmans and Michael P. Wellman, editors, Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, 2016.
- Dueling network architectures for deep reinforcement learning. In Maria-Florina Balcan and Kilian Q. Weinberger, editors, Proceedings of the 33nd International Conference on Machine Learning (ICML), 2016.
- Max Wertheimer. Untersuchungen zur lehre von der gestalt. ii. Psychologische forschung, 1923.
- Self-supervised video object segmentation by motion grouping. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV), 2021.
- An initial attempt of combining visual selective attention with deep reinforcement learning. CoRR, 2018.
- Deep set prediction networks. In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d’Alché-Buc, Emily B. Fox, and Roman Garnett, editors, Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019 (NeurIPS), 2019.
- Unsupervised learning from videos for object discovery in single images. Symmetry, 2021.
- DIP: deep inverse patchmatch for high-resolution optical flow. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
- Deeptam: Deep tracking and mapping. In Vittorio Ferrari, Martial Hebert, Cristian Sminchisescu, and Yair Weiss, editors, 15th European Conference on Computer Vision 2018 (ECCV), 2018.
- Target-aware object discovery and association for unsupervised video multi-object segmentation. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.