Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
112 tokens/sec
GPT-4o
8 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Exploring the Effectiveness of Object-Centric Representations in Visual Question Answering: Comparative Insights with Foundation Models (2407.15589v5)

Published 22 Jul 2024 in cs.CV and cs.LG

Abstract: Object-centric (OC) representations, which model visual scenes as compositions of discrete objects, have the potential to be used in various downstream tasks to achieve systematic compositional generalization and facilitate reasoning. However, these claims have yet to be thoroughly validated empirically. Recently, foundation models have demonstrated unparalleled capabilities across diverse domains, from language to computer vision, positioning them as a potential cornerstone of future research for a wide range of computational tasks. In this paper, we conduct an extensive empirical study on representation learning for downstream Visual Question Answering (VQA), which requires an accurate compositional understanding of the scene. We thoroughly investigate the benefits and trade-offs of OC models and alternative approaches including large pre-trained foundation models on both synthetic and real-world data, ultimately identifying a promising path to leverage the strengths of both paradigms. The extensiveness of our study, encompassing over 600 downstream VQA models and 15 different types of upstream representations, also provides several additional insights that we believe will be of interest to the community at large.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (96)
  1. Recurrent independent mechanisms. arXiv preprint arXiv:1909.10893, 2019.
  2. Object-centric learning with slot attention. Advances in Neural Information Processing Systems, 33:11525–11538, 2020.
  3. Space: Unsupervised object-oriented scene representation via spatial attention and decomposition. In International Conference on Learning Representations, 2020a. URL https://openreview.net/forum?id=rkl03ySYDH.
  4. Monet: Unsupervised scene decomposition and representation, 2019.
  5. Illiterate dall-e learns to compose. In International Conference on Learning Representations, 2022a. URL https://openreview.net/forum?id=h0OYV0We3oh.
  6. Simple unsupervised object-centric learning for complex and naturalistic videos. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022b. URL https://openreview.net/forum?id=eYfIM88MTUE.
  7. Bridging the gap to real-world object-centric learning, 2023.
  8. Slotdiffusion: Object-centric generative modeling with diffusion models. NeurIPS, 2023.
  9. Object-centric slot diffusion. NeurIPS, 2023.
  10. Rotating features for object discovery. Advances in Neural Information Processing Systems (NeurIPS), 2023.
  11. Grounding physical concepts of objects and events through dynamic visual reasoning. In International Conference on Learning Representations, 2021a.
  12. Attention over learned object embeddings enables complex visual reasoning. Advances in neural information processing systems, 34:9112–9124, 2021a.
  13. Slotformer: Unsupervised visual dynamics simulation with object-centric models. arXiv preprint arXiv:2210.05861, 2022.
  14. Dynamic visual reasoning by learning differentiable physics models from video and language. Advances in Neural Information Processing Systems, 34:887–899, 2021b.
  15. A simple neural network module for relational reasoning. Advances in neural information processing systems, 30, 2017.
  16. Systematic visual reasoning through object-centric relational abstraction. arXiv preprint arXiv:2306.02500, 2023.
  17. Learning to reason over visual objects. In The Eleventh International Conference on Learning Representations, 2023.
  18. Roots: Object-centric representation and rendering of 3d scenes. The Journal of Machine Learning Research, 22(1):11770–11805, 2021b.
  19. Savi++: Towards end-to-end object-centric learning from real-world videos. Advances in Neural Information Processing Systems, 35:28940–28954, 2022.
  20. DORSal: Diffusion for Object-centric Representations of Scenes et al. arXiv, 2023.
  21. Building machines that learn and think like people. Behavioral and brain sciences, 40:e253, 2017.
  22. Toward causal representation learning. Proceedings of the IEEE, 109(5):612–634, 2021.
  23. Provably learning object-centric representations. arXiv preprint arXiv:2305.14229, 2023.
  24. Imagine the unseen world: A benchmark for systematic generalization in visual world models. arXiv preprint arXiv:2311.09064, 2023a.
  25. Learning to compose: Improving object centric learning by injecting compositionality. arXiv preprint arXiv:2405.00646, 2024.
  26. Judea Pearl. Causality. Cambridge university press, 2009.
  27. Elements of causal inference: foundations and learning algorithms. The MIT Press, 2017.
  28. Object-centric architectures enable efficient causal representation learning. arXiv preprint arXiv:2310.19054, 2023.
  29. Causal triplet: An open challenge for intervention-centric causal representation learning. arXiv preprint arXiv:2301.05169, 2023.
  30. Generalization and robustness implications in object-centric learning. In International Conference on Machine Learning, 2022.
  31. Inductive biases for object-centric representations in the presence of complex textures. arXiv preprint arXiv:2204.08479, 2022.
  32. An investigation into pre-training object-centric representations for reinforcement learning. arXiv preprint arXiv:2302.04419, 2023.
  33. Segment anything. arXiv:2304.02643, 2023.
  34. Foundationpose: Unified 6d pose estimation and tracking of novel objects. arXiv preprint arXiv:2312.08344, 2023.
  35. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  36. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  37. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113, 2023.
  38. Stu-net: Scalable and transferable medical image segmentation models empowered by large-scale supervised pre-training. arXiv preprint arXiv:2304.06716, 2023.
  39. Audiolm: a language modeling approach to audio generation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023.
  40. The dawn of lmms: Preliminary explorations with gpt-4v (ision). arXiv preprint arXiv:2309.17421, 9(1):1, 2023.
  41. Highly accurate protein structure prediction with alphafold. Nature, 596(7873):583–589, 2021.
  42. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
  43. Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision, pages 2425–2433, 2015.
  44. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2901–2910, 2017.
  45. Attend, infer, repeat: Fast scene understanding with generative models. Advances in neural information processing systems, 29, 2016.
  46. Spatially invariant unsupervised object detection with convolutional neural networks. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 3412–3420, 2019.
  47. Sequential attend, infer, repeat: Generative modelling of moving objects. Advances in Neural Information Processing Systems, 31, 2018.
  48. Scalor: Generative world models with scalable object representations. In International Conference on Learning Representations, 2019.
  49. Structured world belief for reinforcement learning in pomdp. In International Conference on Machine Learning, pages 9744–9755. PMLR, 2021.
  50. Genesis: Generative scene inference and sampling with object-centric latent representations. arXiv preprint arXiv:1907.13052, 2019.
  51. Genesis-v2: Inferring unordered object representations without iterative refinement. Advances in Neural Information Processing Systems, 34:8085–8094, 2021.
  52. Generative video transformer: Can objects be the words? In International Conference on Machine Learning, pages 11307–11318. PMLR, 2021.
  53. Improving generative imagination in object-centric world models. In International Conference on Machine Learning, pages 6140–6149. PMLR, 2020b.
  54. Neural expectation maximization. Advances in Neural Information Processing Systems, 30, 2017.
  55. Draw: A recurrent neural network for image generation. In International conference on machine learning, pages 1462–1471. PMLR, 2015.
  56. Generative modeling of infinite occluded objects for compositional scene representation. In International Conference on Machine Learning, pages 7222–7231. PMLR, 2019.
  57. Contrastive learning of structured world models. arXiv preprint arXiv:1911.12247, 2019.
  58. Conditional Object-Centric Learning from Video. In International Conference on Learning Representations (ICLR), 2022.
  59. Grounded object-centric learning. In The Twelfth International Conference on Learning Representations, 2023.
  60. Object scene representation transformer. Advances in Neural Information Processing Systems, 35:9512–9524, 2022.
  61. Invariant slot attention: Object discovery with slot-centric reference frames. In International Conference on Machine Learning, 2023.
  62. Improving object-centric learning with query optimization. In The Eleventh International Conference on Learning Representations, 2023.
  63. Explicitly disentangled representations in object-centric learning. https://synthical.com/article/edf75c63-c93a-4f21-9f96-2734242f9b94, 0 2024.
  64. Shepherding slots to objects: Towards stable and robust object-centric learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19198–19207, 2023b.
  65. Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378, 2023.
  66. Benchmarking unsupervised object representations for video sequences. The Journal of Machine Learning Research, 22(1):8253–8313, 2021.
  67. Yafei Yang and Bo Yang. Benchmarking and analysis of unsupervised object segmentation from real-world single images. International Journal of Computer Vision, pages 1–37, 2024.
  68. Dinov2: Learning robust visual features without supervision, 2023.
  69. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000–16009, 2022.
  70. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  71. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  72. Relational deep reinforcement learning, 2018.
  73. Spatial broadcast decoder: A simple architecture for learning disentangled representations in vaes. arXiv preprint arXiv:1901.07017, 2019.
  74. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9650–9660, 2021.
  75. Visual interaction networks: Learning a physics simulator from video. Advances in neural information processing systems, 30, 2017.
  76. Vision transformers need registers, 2023.
  77. Auto-encoding variational bayes. In Yoshua Bengio and Yann LeCun, editors, 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings, 2014. URL http://arxiv.org/abs/1312.6114.
  78. Stochastic backpropagation and approximate inference in deep generative models. In International conference on machine learning, pages 1278–1286. PMLR, 2014.
  79. dsprites: Disentanglement testing sprites dataset. https://github.com/deepmind/dsprites-dataset/, 2017.
  80. Multi-object representation learning with iterative variational inference. In International Conference on Machine Learning, pages 2424–2433. PMLR, 2019.
  81. Clevrtex: A texture-rich benchmark for unsupervised multi-object segmentation. arXiv preprint arXiv:2111.10265, 2021.
  82. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6904–6913, 2017.
  83. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014.
  84. Comparing partitions. Journal of classification, 2:193–218, 1985.
  85. Contour detection and hierarchical image segmentation. IEEE transactions on pattern analysis and machine intelligence, 33(5):898–916, 2010.
  86. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
  87. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  88. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in neural information processing systems, 32, 2019.
  89. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  90. Dci-es: An extended disentanglement framework with connections to identifiability. In The Eleventh International Conference on Learning Representations, 2023.
  91. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
  92. The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. arXiv:1811.00982, 2018.
  93. Neural discrete representation learning. Advances in neural information processing systems, 30, 2017.
  94. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  95. Exploring models and data for image question answering. Advances in neural information processing systems, 28, 2015.
  96. An empirical study of gpt-3 for few-shot knowledge-based vqa. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 3081–3089, 2022.
Citations (1)

Summary

We haven't generated a summary for this paper yet.