Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Unsupervised Object-Centric Learning from Multiple Unspecified Viewpoints (2401.01922v1)

Published 3 Jan 2024 in cs.CV and cs.LG

Abstract: Visual scenes are extremely diverse, not only because there are infinite possible combinations of objects and backgrounds but also because the observations of the same scene may vary greatly with the change of viewpoints. When observing a multi-object visual scene from multiple viewpoints, humans can perceive the scene compositionally from each viewpoint while achieving the so-called ``object constancy'' across different viewpoints, even though the exact viewpoints are untold. This ability is essential for humans to identify the same object while moving and to learn from vision efficiently. It is intriguing to design models that have a similar ability. In this paper, we consider a novel problem of learning compositional scene representations from multiple unspecified (i.e., unknown and unrelated) viewpoints without using any supervision and propose a deep generative model which separates latent representations into a viewpoint-independent part and a viewpoint-dependent part to solve this problem. During the inference, latent representations are randomly initialized and iteratively updated by integrating the information in different viewpoints with neural networks. Experiments on several specifically designed synthetic datasets have shown that the proposed method can effectively learn from multiple unspecified viewpoints.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (51)
  1. S. P. Johnson, “How infants learn about the visual world,” Cognitive Science, vol. 34, no. 7, pp. 1158–1184, 2010.
  2. J. A. Fodor and Z. W. Pylyshyn, “Connectionism and cognitive architecture: A critical analysis,” Cognition, vol. 28, no. 1, pp. 3–71, 1988.
  3. B. M. Lake, T. D. Ullman, J. B. Tenenbaum, and S. J. Gershman, “Building machines that learn and think like people,” Behavioral and Brain Sciences, vol. 40, p. e253, 2017.
  4. O. H. Turnbull, D. P. Carey, and R. A. McCarthy, “The neuropsychology of object constancy,” Journal of the International Neuropsychological Society, vol. 3, no. 3, pp. 288–298, 1997.
  5. R. N. Shepard and J. Metzler, “Mental rotation of three-dimensional objects,” Science, vol. 171, no. 3972, pp. 701–703, 1971.
  6. S. Eslami, N. Heess, T. Weber, Y. Tassa, D. Szepesvari, K. Kavukcuoglu, and G. E. Hinton, “Attend, infer, repeat: Fast scene understanding with generative models,” in Proceedings of the Neural Information Processing Systems, 2016, pp. 3225–3233.
  7. K. Greff, S. van Steenkiste, and J. Schmidhuber, “Neural expectation maximization,” in Proceedings of the Neural Information Processing Systems, 2017, pp. 6694–6704.
  8. C. P. Burgess, L. Matthey, N. Watters, R. Kabra, I. Higgins, M. Botvinick, and A. Lerchner, “MONet: Unsupervised scene decomposition and representation,” ArXiv, vol. 1901.11390, 2019.
  9. K. Greff, R. L. Kaufman, R. Kabra, N. Watters, C. P. Burgess, D. Zoran, L. Matthey, M. Botvinick, and A. Lerchner, “Multi-object representation learning with iterative variational inference,” in Proceedings of the International Conference on Machine Learning, 2019, pp. 2424–2433.
  10. F. Locatello, D. Weissenborn, T. Unterthiner, A. Mahendran, G. Heigold, J. Uszkoreit, A. Dosovitskiy, and T. Kipf, “Object-centric learning with slot attention,” in Proceedings of the Neural Information Processing Systems, 2020, pp. 11 515–11 528.
  11. N. Li, C. Eastwood, and R. B. Fisher, “Learning object-centric representations of multi-object scenes from multiple views,” in Proceedings of the Neural Information Processing Systems, 2020, pp. 5656–5666.
  12. L. Nanbo, M. A. Raza, H. Wenbin, Z. Sun, and R. B. Fisher, “Object-centric representation learning with generative spatial-temporal factorization,” in Proceedings of the Neural Information Processing Systems, 2021, pp. 10 772–10 783.
  13. C. Chen, F. Deng, and S. Ahn, “ROOTS: Object-centric representation and rendering of 3d scenes,” Journal of Machine Learning Research, vol. 22, no. 259, pp. 1–36, 2021.
  14. R. Kabra, D. Zoran, G. Erdogan, L. Matthey, A. Creswell, M. Botvinick, A. Lerchner, and C. P. Burgess, “SIMONe: View-invariant, temporally-abstracted object representations via unsupervised video decomposition,” in Proceedings of the Neural Information Processing Systems, 2021, pp. 20 146–20 159.
  15. J. Yuan, B. Li, and X. Xue, “Unsupervised learning of compositional scene representations from multiple unspecified viewpoints,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2022, pp. 8971–8979.
  16. J. Huang and K. Murphy, “Efficient inference in occlusion-aware generative models of images,” in International Conference on Learning Representations (Workshop), 2016.
  17. J. Yuan, B. Li, and X. Xue, “Generative modeling of infinite occluded objects for compositional scene representation,” in Proceedings of the International Conference on Machine Learning, 2019, pp. 7222–7231.
  18. E. Crawford and J. Pineau, “Spatially invariant unsupervised object detection with convolutional neural networks,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2019, pp. 3412–3420.
  19. Z. Lin, Y.-F. Wu, S. Peri, W. Sun, G. Singh, F. Deng, J. Jiang, and S. Ahn, “SPACE: Unsupervised object-oriented scene representation via spatial attention and decomposition,” in International Conference on Learning Representations, 2020.
  20. J. Yuan, B. Li, and X. Xue, “Spatial mixture models with learnable deep priors for perceptual grouping,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2019, pp. 9135–9142.
  21. P. Emami, P. He, S. Ranka, and A. Rangarajan, “Efficient iterative amortized inference for learning symmetric and disentangled multi-object representations,” in Proceedings of the International Conference on Machine Learning, 2021, pp. 2970–2981.
  22. K. Stelzner, K. Kersting, and A. R. Kosiorek, “Decomposing 3d scenes into objects via unsupervised volume segmentation,” arXiv, 2021.
  23. M. Engelcke, A. R. Kosiorek, O. P. Jones, and I. Posner, “GENESIS: Generative scene inference and sampling with object-centric latent representations,” in International Conference on Learning Representations, 2020.
  24. J. Jiang and S.-J. Ahn, “Generative neurosymbolic machines,” in Proceedings of the Neural Information Processing Systems, 2020, pp. 12 572–12 582.
  25. J. Yuan, B. Li, and X. Xue, “Knowledge-guided object discovery with acquired deep impressions,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2021, pp. 10 798–10 806.
  26. S. van Steenkiste, M. Chang, K. Greff, and J. Schmidhuber, “Relational neural expectation maximization: Unsupervised discovery of objects and their interactions,” in International Conference on Learning Representations, 2018.
  27. A. R. Kosiorek, H. Kim, I. Posner, and Y. Teh, “Sequential Attend, Infer, Repeat: Generative modelling of moving objects,” in Proceedings of the Neural Information Processing Systems, 2018, pp. 8615–8625.
  28. A. Stanic and J. Schmidhuber, “R-SQAIR: Relational sequential attend, infer, repeat,” in Proceedings of the Neural Information Processing Systems (Workshop), 2019.
  29. Z. He, J. Li, D. Liu, H. He, and D. Barber, “Tracking by animation: Unsupervised learning of multi-object attentive trackers,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 1318–1327.
  30. E. Crawford and J. Pineau, “Exploiting spatial invariance for scalable unsupervised object tracking,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2020, pp. 3684–3692.
  31. J. Jiang, S. Janghorbani, G. de Melo, and S. Ahn, “SCALOR: Generative world models with scalable object representations,” in International Conference on Learning Representations, 2020.
  32. R. Veerapaneni, J. D. Co-Reyes, M. Chang, M. Janner, C. Finn, J. Wu, J. Tenenbaum, and S. Levine, “Entity abstraction in visual model-based reinforcement learning,” in Conference on Robot Learning, 2020, pp. 1439–1456.
  33. P. Zablotskaia, E. A. Dominici, L. Sigal, and A. M. Lehrmann, “PROVIDE: A probabilistic framework for unsupervised video decomposition,” in Proceedings of the Thirty-Seventh Conference on Uncertainty in Artificial Intelligence, vol. 161, 2021, pp. 2019–2028.
  34. T. Kipf, G. F. Elsayed, A. Mahendran, A. Stone, S. Sabour, G. Heigold, R. Jonschkowski, A. Dosovitskiy, and K. Greff, “Conditional object-centric learning from video,” in International Conference on Learning Representations, 2022.
  35. C. Gao and B. Li, “Time-conditioned generative modeling of object-centric representations for video decomposition and prediction,” in Proceedings of the Conference on Uncertainty in Artificial Intelligence, 2023, pp. 613–623.
  36. M. A. Weis, K. Chitta, Y. Sharma, W. Brendel, M. Bethge, A. Geiger, and A. S. Ecker, “Benchmarking unsupervised object representations for video sequences,” Journal of Machine Learning Research, vol. 22, no. 183, pp. 1–61, 2021.
  37. J. Marino, Y. Yue, and S. Mandt, “Iterative amortized inference,” in Proceedings of the International Conference on Machine Learning, 2018, pp. 3403–3412.
  38. S. Eslami, D. J. Rezende, F. Besse, F. Viola, A. S. Morcos, M. Garnelo, A. Ruderman, A. A. Rusu, I. Danihelka, K. Gregor, D. P. Reichert, L. Buesing, T. Weber, O. Vinyals, D. Rosenbaum, N. C. Rabinowitz, H. King, C. Hillier, M. Botvinick, D. Wierstra, K. Kavukcuoglu, and D. Hassabis, “Neural scene representation and rendering,” Science, vol. 360, pp. 1204–1210, 2018.
  39. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. u. Kaiser, and I. Polosukhin, “Attention is all you need,” in Proceedings of the Neural Information Processing Systems, 2017, pp. 6000–6010.
  40. T. Salimans and D. A. Knowles, “Fixed-form variational posterior approximation through stochastic linear regression,” Bayesian Analysis, vol. 8, no. 4, pp. 837–882, 2013.
  41. D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” in International Conference on Learning Representations, 2014.
  42. C. J. Maddison, A. Mnih, and Y. Teh, “The concrete distribution: A continuous relaxation of discrete random variables,” in International Conference on Learning Representations, 2017.
  43. E. Jang, S. Gu, and B. Poole, “Categorical reparameterization with gumbel-softmax,” in International Conference on Learning Representations, 2017.
  44. J. Johnson, B. Hariharan, L. van der Maaten, L. Fei-Fei, C. L. Zitnick, and R. B. Girshick, “CLEVR: A diagnostic dataset for compositional language and elementary visual reasoning,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 1988–1997.
  45. M. Nazarczuk and K. Mikolajczyk, “SHOP-VRB: A visual reasoning benchmark for object perception,” in IEEE International Conference on Robotics and Automation, 2020, pp. 6898–6904.
  46. L. Downs, A. Francis, N. Koenig, B. Kinman, R. Hickman, K. Reymann, T. B. McHugh, and V. Vanhoucke, “Google scanned objects: A high-quality dataset of 3d scanned household items,” in 2022 International Conference on Robotics and Automation (ICRA), 2022, pp. 2553–2560.
  47. A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su, J. Xiao, L. Yi, and F. Yu, “ShapeNet: An information-rich 3d model repository,” Stanford University — Princeton University — Toyota Technological Institute at Chicago, Tech. Rep. arXiv:1512.03012 [cs.GR], 2015.
  48. K. Greff, F. Belletti, L. Beyer, C. Doersch, Y. Du, D. Duckworth, D. J. Fleet, D. Gnanapragasam, F. Golemo, C. Herrmann, T. Kipf, A. Kundu, D. Lagun, I. Laradji, H.-T. Liu, H. Meyer, Y. Miao, D. Nowrouzezahrai, C. Oztireli, E. Pot, N. Radwan, D. Rebain, S. Sabour, M. S. M. Sajjadi, M. Sela, V. Sitzmann, A. Stone, D. Sun, S. Vora, Z. Wang, T. Wu, K. M. Yi, F. Zhong, and A. Tagliasacchi, “Kubric: A scalable dataset generator,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2022, pp. 3739–3751.
  49. J. Yuan, T. Chen, B. Li, and X. Xue, “Compositional scene representation learning via reconstruction: A survey,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 10, pp. 11 540–11 560, 2023.
  50. L. Hubert and P. Arabie, “Comparing partitions,” Journal of Classification, vol. 2, pp. 193–218, 1985.
  51. X. Nguyen, J. Epps, and J. Bailey, “Information theoretic measures for clusterings comparison: Variants, properties, normalization and correction for chance,” Journal of Machine Learning Research, vol. 11, pp. 2837–2854, 2010.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Jinyang Yuan (5 papers)
  2. Tonglin Chen (5 papers)
  3. Zhimeng Shen (4 papers)
  4. Bin Li (514 papers)
  5. Xiangyang Xue (169 papers)
Citations (2)

Summary

We haven't generated a summary for this paper yet.