Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
126 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Neural Slot Interpreters: Grounding Object Semantics in Emergent Slot Representations (2403.07887v2)

Published 2 Feb 2024 in cs.CV and cs.AI

Abstract: Several accounts of human cognition posit that our intelligence is rooted in our ability to form abstract composable concepts, ground them in our environment, and reason over these grounded entities. This trifecta of human thought has remained elusive in modern intelligent machines. In this work, we investigate whether slot representations extracted from visual scenes serve as appropriate compositional abstractions for grounding and reasoning. We present the Neural Slot Interpreter (NSI), which learns to ground object semantics in slots. At the core of NSI is an XML-like schema that uses simple syntax rules to organize the object semantics of a scene into object-centric schema primitives. Then, the NSI metric learns to ground primitives into slots through a structured objective that reasons over the intermodal alignment. We show that the grounded slots surpass unsupervised slots in real-world object discovery and scale with scene complexity. Experiments with a bi-modal object-property and scene retrieval task demonstrate the grounding efficacy and interpretability of correspondences learned by NSI. Finally, we investigate the reasoning abilities of the grounded slots. Vision Transformers trained on grounding-aware NSI tokenizers using as few as ten tokens outperform patch-based tokens on challenging few-shot classification tasks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (39)
  1. Sparks of Artificial General Intelligence: Early Experiments with GPT-4. CoRR, abs/2303.12712, 2023.
  2. MONet: Unsupervised Scene Decomposition and Representation. CoRR, abs/2303.08774, 2019.
  3. End-to-End Object Detection with Transformers. CoRR, abs/2005.12872, 2020.
  4. Emerging Properties in Self-Supervised Vision Transformers. CoRR, abs/2104.14294, 2021.
  5. Neural Constraint Satisfaction: Hierarchical Abstraction for Combinatorial Generalization in Object Rearrangement. CoRR, abs/2303.11373, 2023.
  6. Im-Promptu: In-Context Composition from Image Prompts. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
  7. RobustFill: Neural Program Learning under Noisy I/O. CoRR, abs/1703.07469, 2017.
  8. Structured Information Extraction from Complex Scientific Text with Fine-Tuned Large Language Models. CoRR, abs/2212.05238, 2022.
  9. DreamCoder: Growing Generalizable, Interpretable Knowledge with Wake-Sleep Bayesian Program Learning. CoRR, abs/2006.08381, 2020.
  10. SAVi++: Towards End-to-End Object-Centric Learning from Real-World Videos. CoRR, abs/2206.07664, 2022.
  11. GENESIS: Generative Scene Inference and Sampling with Object-Centric Latent Representations. CoRR, abs/1907.13052, 2019.
  12. GENESIS-V2: Inferring Unordered Object Representations without Iterative Refinement. CoRR, abs/2104/09958, 2021.
  13. Multi-Object Representation Learning with Iterative Variational Inference. CoRR, abs/1903.00450, 2019.
  14. On the Binding Problem in Artificial Neural Networks. CoRR, abs/2012.05208, 2020.
  15. Kubric: A Scalable Dataset Generator. CoRR, abs/2203.03570, 2022.
  16. DORSal: Diffusion for Object-centric Representations of Scenes et al. CoRR, abs/2306.08068, 2023.
  17. Object-Centric Slot Diffusion. CoRR, abs/2303.10834, 2023.
  18. ClevrTex: A Texture-Rich Benchmark for Unsupervised Multi-Object Segmentation. CoRR, abs/2111.10265, 2021.
  19. Conditional Object-Centric Learning from Video. CoRR, abs/2111.12594, 2021.
  20. Kuhn, H. W. The Hungarian method for the assignment problem. Naval Research Logistics (NRL), 52, 1955.
  21. Human-like Systematic Generalization through a Meta-Learning Neural Network. Nature, 623(7985):115–121, 2023.
  22. One Shot Learning of Simple Visual Concepts. Cognitive Science, 33, 2011.
  23. Human Few-Shot Learning of Compositional Instructions. CoRR, abs/1901.04587, 2019.
  24. Perspective Plane Program Induction from a Single Image. CoRR, abs/2006.14708, 2020.
  25. Microsoft COCO: Common Objects in Context. CoRR, abs/1405.0312, 2015.
  26. Object-Centric Learning with Slot Attention. CoRR, abs/2006.15055, 2020.
  27. Learning Compositional Rules via Neural Program Synthesis. CoRR, abs/2003.05562, 2020.
  28. Elucidating Image-to-Set Prediction: An Analysis of Models, Losses and Datasets. CoRR, abs/1904.05709, 2020.
  29. Compositionality of Rule Representations in Human Prefrontal Cortex. Cerebral Cortex, 22(6):1237–1246, 2012.
  30. DeepSetNet: Predicting Sets with Deep Neural Networks. CoRR, abs/1611.08998, 2017.
  31. Bridging the Gap to Real-World Object-Centric Learning. CoRR, abs/2209.14860, 2023.
  32. Illiterate DALL-E Learns to Compose. CoRR, abs/2110.11405, 2021.
  33. Simple Unsupervised Object-Centric Learning for Complex and Naturalistic Videos. CoRR, abs/2205.14065, 2022.
  34. Rethinking Transformer-based Set Prediction for Object Detection. CoRR, abs/2011.10881, 2021.
  35. Visualizing Data using t-SNE. Journal of Machine Learning Research, 9(86):2579–2605, 2008.
  36. Neural Scene De-rendering. In IEEE Conference on Computer Vision and Pattern Recognition, 2017.
  37. SlotFormer: Unsupervised Visual Dynamics Simulation with Object-Centric Models. CoRR, abs/2210.05861, 2023a.
  38. SlotDiffusion: Object-Centric Generative Modeling with Diffusion Models. CoRR, abs/2305.11281, 2023b.
  39. Neural Task Programming: Learning to Generalize Across Hierarchical Tasks. CoRR, abs/1710.01813, 2018.
Citations (1)

Summary

We haven't generated a summary for this paper yet.