Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
125 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Reasoning-Enhanced Object-Centric Learning for Videos (2403.15245v1)

Published 22 Mar 2024 in cs.CV, cs.AI, and cs.LG

Abstract: Object-centric learning aims to break down complex visual scenes into more manageable object representations, enhancing the understanding and reasoning abilities of machine learning systems toward the physical world. Recently, slot-based video models have demonstrated remarkable proficiency in segmenting and tracking objects, but they overlook the importance of the effective reasoning module. In the real world, reasoning and predictive abilities play a crucial role in human perception and object tracking; in particular, these abilities are closely related to human intuitive physics. Inspired by this, we designed a novel reasoning module called the Slot-based Time-Space Transformer with Memory buffer (STATM) to enhance the model's perception ability in complex scenes. The memory buffer primarily serves as storage for slot information from upstream modules, the Slot-based Time-Space Transformer makes predictions through slot-based spatiotemporal attention computations and fusion. Our experiment results on various datasets show that STATM can significantly enhance object-centric learning capabilities of slot-based video models.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (69)
  1. Interaction networks for learning about objects, relations and physics. Advances in neural information processing systems, 29, 2016.
  2. Learning physical graph representations from visual scenes. Advances in Neural Information Processing Systems, 33:6027–6039, 2020.
  3. Is space-time attention all you need for video understanding? In ICML, volume 2, pp.  4, 2021.
  4. Jax: composable transformations of python+ numpy programs. 2018.
  5. Monet: Unsupervised scene decomposition and representation. arXiv preprint arXiv:1901.11390, 2019.
  6. A compositional object-based approach to learning physical dynamics. arXiv preprint arXiv:1612.00341, 2016.
  7. Detecting deepfake videos based on spatiotemporal attention and convolutional lstm. Information Sciences, 601:58–70, 2022.
  8. Roots: Object-centric representation and rendering of 3d scenes. The Journal of Machine Learning Research, 22(1):11770–11805, 2021.
  9. Spatio-temporal attention-based neural network for credit card fraud detection. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pp.  362–369, 2020.
  10. Interpretable spatio-temporal attention lstm model for flood forecasting. Neurocomputing, 403:348–359, 2020.
  11. Generalization and robustness implications in object-centric learning. arXiv preprint arXiv:2107.00637, 2021.
  12. Learning multi-object dynamics with compositional neural radiance fields. In Conference on Robot Learning, pp.  1755–1768. PMLR, 2023.
  13. Savi++: Towards end-to-end object-centric learning from real-world videos. Advances in Neural Information Processing Systems, 35:28940–28954, 2022.
  14. Kubric: A scalable dataset generator. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  3749–3761, 2022.
  15. Visual attention methods in deep learning: An in-depth survey. arXiv preprint arXiv:2204.07756, 2022.
  16. Flax: A neural network library and ecosystem for jax, 2020. URL http://github.com/google/flax, 1, 2020.
  17. Unsupervised object-centric video generation and decomposition in 3d. Advances in Neural Information Processing Systems, 33:3106–3117, 2020.
  18. Visuomotor control in multi-object scenes using object-aware representations. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pp.  9515–9522. IEEE, 2023.
  19. Sas: Dialogue state tracking via slot attention and slot information sharing. In Proceedings of the 58th annual meeting of the association for computational linguistics, pp.  6366–6375, 2020.
  20. Comparing partitions. Journal of classification, 2:193–218, 1985.
  21. Denoising criterion for variational auto-encoding framework. In Proceedings of the AAAI conference on artificial intelligence, volume 31, 2017.
  22. Reasoning about physical interactions with object-oriented prediction and planning. In International Conference on Learning Representations, 2019.
  23. Scalor: Generative world models with scalable object representations. arXiv preprint arXiv:1910.02384, 2019.
  24. Social physics. Physics Reports, 948:1–148, 2022.
  25. Simone: View-invariant, temporally-abstracted object representations via unsupervised video decomposition. Advances in Neural Information Processing Systems, 34:20146–20159, 2021.
  26. The reviewing of object files: Object-specific integration of information. Cognitive psychology, 24(2):175–219, 1992.
  27. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  28. Contrastive learning of structured world models. arXiv preprint arXiv:1911.12247, 2019.
  29. Conditional object-centric learning from video. In International Conference on Learning Representations, 2021.
  30. Sequential attend, infer, repeat: Generative modelling of moving objects. Advances in Neural Information Processing Systems, 31, 2018.
  31. Intuitive physics: Current research and controversies. Trends in cognitive sciences, 21(10):749–759, 2017.
  32. Ma-dst: Multi-attention-based scalable dialog state tracking. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pp.  8107–8114, 2020.
  33. Building machines that learn and think like people. Behavioral and brain sciences, 40:e253, 2017.
  34. Spatio-temporal attention networks for action recognition and detection. IEEE Transactions on Multimedia, 22(11):2990–3001, 2020.
  35. Vehicle trajectory prediction using lstms with spatial–temporal attention mechanisms. IEEE Intelligent Transportation Systems Magazine, 14(2):197–208, 2021.
  36. Object-centric learning with slot attention. Advances in Neural Information Processing Systems, 33:11525–11538, 2020.
  37. Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983, 2016.
  38. Stan: Spatio-temporal attention network for next location recommendation. In Proceedings of the web conference 2021, pp.  2177–2185, 2021.
  39. When physics meets machine learning: A survey of physics-informed machine learning. arXiv preprint arXiv:2203.16797, 2022.
  40. When it all falls down: the relationship between intuitive physics and spatial cognition. Cognitive research: principles and implications, 5:1–13, 2020.
  41. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019.
  42. Intuitive physics learning in a deep-learning model inspired by developmental psychology. Nature human behaviour, 6(9):1257–1267, 2022.
  43. Rand, W. M. Objective criteria for the evaluation of clustering methods. Journal of the American Statistical association, 66(336):846–850, 1971.
  44. Research, G. Google scanned objects, 2020. URL https://app.ignitionrobotics.org/GoogleResearch/fuel/collections/Google%20Scanned%20Objects.
  45. Intphys: A framework and benchmark for visual intuitive physics reasoning. arXiv preprint arXiv:1803.07616, 2018.
  46. Illiterate dall-e learns to compose. arXiv preprint arXiv:2110.11405, 2021.
  47. Simple unsupervised object-centric learning for complex and naturalistic videos. Advances in Neural Information Processing Systems, 35:18181–18196, 2022.
  48. Smith, B. C. The promise of artificial intelligence: reckoning and judgment. Mit Press, 2019.
  49. R-sqair: Relational sequential attend, infer, repeat. arXiv preprint arXiv:1910.05231, 2019.
  50. Sudderth, E. B. Graphical models for visual object recognition and tracking. PhD thesis, Massachusetts Institute of Technology, 2006.
  51. Intrinsic physical concepts discovery with object-centric predictive models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  23252–23261, 2023.
  52. Mind games: Game engines as an architecture for intuitive physics. Trends in cognitive sciences, 21(9):649–665, 2017.
  53. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  54. Entity abstraction in visual model-based reinforcement learning. In Conference on Robot Learning, pp.  1439–1456. PMLR, 2020.
  55. Slot-vae: Object-centric scene generation with slot attention. arXiv preprint arXiv:2306.06997, 2023.
  56. Visual interaction networks: Learning a physics simulator from video. Advances in neural information processing systems, 30, 2017.
  57. Spatial broadcast decoder: A simple architecture for learning disentangled representations in vaes. arXiv preprint arXiv:1901.07017, 2019.
  58. Unmasking the inductive biases of unsupervised object representations for video sequences. arXiv preprint arXiv:2006.07034, 2, 2020.
  59. A survey of joint intent detection and slot filling models in natural language understanding. ACM Computing Surveys, 55(8):1–38, 2022.
  60. Slotformer: Unsupervised visual dynamics simulation with object-centric models. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=TFbwV6I0VLg.
  61. Segmenting moving objects via an object-centric layered representation. Advances in Neural Information Processing Systems, 35:28023–28036, 2022.
  62. End-to-end slot alignment and recognition for cross-lingual nlu. arXiv preprint arXiv:2004.14353, 2020.
  63. Self-supervised video object segmentation by motion grouping. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  7177–7188, 2021.
  64. Predicting pedestrian crossing intention with feature fusion and spatio-temporal attention. IEEE Transactions on Intelligent Vehicles, 7(2):221–230, 2022.
  65. Slot self-attentive dialogue state tracking. In Proceedings of the Web Conference 2021, pp.  1598–1608, 2021.
  66. Deep learning with spatiotemporal attention-based lstm for industrial soft sensor model development. IEEE Transactions on Industrial Electronics, 68(5):4404–4414, 2020.
  67. Is an object-centric video representation beneficial for transfer? In Proceedings of the Asian Conference on Computer Vision, pp.  1976–1994, 2022.
  68. Ecg-based multi-class arrhythmia detection using spatio-temporal attention-based convolutional recurrent neural network. Artificial Intelligence in Medicine, 106:101856, 2020.
  69. Parts: Unsupervised segmentation with slots, attention and independence maximization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  10439–10447, 2021.
Citations (1)

Summary

We haven't generated a summary for this paper yet.