Relational Neural Expectation Maximization: Unsupervised Discovery of Objects and their Interactions (1802.10353v1)

Published 28 Feb 2018 in cs.LG, cs.AI, and cs.NE

Abstract: Common-sense physical reasoning is an essential ingredient for any intelligent agent operating in the real-world. For example, it can be used to simulate the environment, or to infer the state of parts of the world that are currently unobserved. In order to match real-world conditions this causal knowledge must be learned without access to supervised data. To address this problem we present a novel method that learns to discover objects and model their physical interactions from raw visual images in a purely \emph{unsupervised} fashion. It incorporates prior knowledge about the compositional nature of human perception to factor interactions between object-pairs and learn efficiently. On videos of bouncing balls we show the superior modelling capabilities of our method compared to other unsupervised neural approaches that do not incorporate such prior knowledge. We demonstrate its ability to handle occlusion and show that it can extrapolate learned knowledge to scenes with different numbers of objects.

Authors (4)

Sjoerd van Steenkiste (33 papers)
Michael Chang (18 papers)
Klaus Greff (32 papers)
Jürgen Schmidhuber (124 papers)

Citations (285)

View on Semantic Scholar

Summary

Relational Neural Expectation Maximization: Unsupervised Discovery of Objects and their Interactions

The paper discusses an approach to common-sense physical reasoning in machine learning, emphasizing the unsupervised discovery of objects and their interactions. The research introduces the Relational Neural Expectation Maximization (R-NEM) model, an extension of the Neural Expectation Maximization (N-EM) framework, designed to learn compositional object representations and their relational dynamics from raw visual images without supervision.

The significance of capturing and modeling common-sense physical reasoning in AI lies in its emulation of human cognitive capabilities, particularly the ability to predict and understand the dynamics of physical interactions in the environment. Human cognition heavily relies on the intuitive understanding of objects and their behaviors, an insight supported by converging evidence from cognitive science and developmental psychology. The paper offers a machine learning method that similarly integrates object perception and interaction modeling, aiming to bridge a gap towards more human-like understanding in artificial systems.

Methodology

R-NEM enhances the N-EM framework by incorporating an interaction mechanism to model interactions between objects. This relational mechanism allows the system to factorize interactions among object-pairs, significantly improving the efficiency and generalization capabilities of the model. The relational approach contrasts with other unsupervised neural methods that lack compositionality at the level of object representations, preventing these models from effectively learning and generalizing interactions.

The core innovation lies in the formulation of the interaction function within R-NEM, which iteratively refines object representations through a recurrent neural network (RNN) model. The interaction function, denoted as $\Upsilon^{\text{R-NEM}}$ , acts as a message-passing neural network, processing pairwise effects between object representations. This function is critical for accommodating the varying number of objects in different scenes, thus facilitating extrapolation beyond the specific configuration present in the training data.

Experimental Evaluation

The experiments conducted assess the effectiveness of R-NEM in three distinct physical reasoning tasks: bouncing balls with variable mass, an occluded curtain within a bouncing ball scenario, and Space Invaders from the Arcade Learning Environment. These tasks vary in their dynamical and visual complexity. The results highlight that R-NEM consistently outperforms other methods lacking structured priors reflecting real-world dynamics, such as LSTM and RNN in terms of Binomial Cross-Entropy (BCE) and relational BCE losses.

A notable strength of R-NEM is demonstrated in its ability to generalize learned physical interactions to environments with varying numbers of objects. This extrapolation is particularly evident when tested on sequences containing a greater number of objects than seen during training. R-NEM's simulation capabilities are further exhibited through the recursive application of its dynamics equations for each object's learned state over time, showcasing accurate predictions even when objects are occluded.

Implications and Future Directions

The research implicates that integrating inductive biases about the existence of objects and interactions is crucial for achieving efficient and human-like generalization in AI systems dealing with physical environments. The compositional nature of R-NEM's object representations aligns with cognitive theories suggesting innate biases towards object-focused cognition in humans.

Future work may explore incorporating task-specific top-down feedback to facilitate dynamic groupings of objects within R-NEM, potentially enhancing its application in reinforcement learning. This approach could enable an agent to learn modular policies that generalize across novel configurations of known objects. Beyond these theoretical implications, practical applications may extend to robotics and interactive AI systems where understanding and anticipating physical interactions are paramount.

Overall, the R-NEM model provides a step towards an unsupervised basis for acquiring common-sense physical reasoning capabilities, fostering further advancements in AI's ability to model and predict complex physical realities.

PDF Markdown

Related Papers

Find Related Papers

Tweets

https://twitter.com/vansteenkiste_s/status/1849962892885623230

YouTube

Show All Videos