Sequential Attend, Infer, Repeat: Generative Modelling of Moving Objects (1806.01794v2)

Published 5 Jun 2018 in cs.LG, cs.CV, and stat.ML

Abstract: We present Sequential Attend, Infer, Repeat (SQAIR), an interpretable deep generative model for videos of moving objects. It can reliably discover and track objects throughout the sequence of frames, and can also generate future frames conditioning on the current frame, thereby simulating expected motion of objects. This is achieved by explicitly encoding object presence, locations and appearances in the latent variables of the model. SQAIR retains all strengths of its predecessor, Attend, Infer, Repeat (AIR, Eslami et. al., 2016), including learning in an unsupervised manner, and addresses its shortcomings. We use a moving multi-MNIST dataset to show limitations of AIR in detecting overlapping or partially occluded objects, and show how SQAIR overcomes them by leveraging temporal consistency of objects. Finally, we also apply SQAIR to real-world pedestrian CCTV data, where it learns to reliably detect, track and generate walking pedestrians with no supervision.

Citations (249)

View on Semantic Scholar

Summary

The paper extends AIR by integrating temporal state-space modeling to robustly track objects over sequential video frames.
It employs an unsupervised dual-stage process of propagation and discovery to effectively manage occlusions and overlapping objects.
It demonstrates improved object counting accuracy and reconstruction error on synthetic and real-world datasets, showcasing its practical applicability.

Sequential Attend, Infer, Repeat: A Deep Generative Model for Moving Objects

The paper "Sequential Attend, Infer, Repeat: Generative Modelling of Moving Objects" introduces SQAIR, a novel deep generative model specifically designed for sequential data, such as videos, which involve moving objects. At its core, SQAIR extends the capabilities of its predecessor, AIR, by incorporating a state-space model that leverages temporal information to improve the discovery and tracking of objects across video frames. This extension enables SQAIR to maintain temporal consistency in object representation across sequential frames, solving some of the inherent limitations of the AIR model, such as the inability to detect overlapping and partially occluded objects consistently.

Key Contributions

Extension of AIR: SQAIR builds on the Attend, Infer, Repeat (AIR) framework by introducing a sequential extension that factors in temporal information, allowing for more robust object tracking over time. This is accomplished through a spatio-temporal state-space model that enables the generative model to capture not only static representations but also dynamic changes in object appearance and location.
Unsupervised Learning: Similar to AIR, SQAIR is trained in an unsupervised manner. This feature is particularly notable as it implies that the model does not require labeled datasets for training. Instead, it learns to decompose a video into its constituent objects and tracks their motion using solely the visual input.
Handling Occlusions and Overlaps: By integrating temporal consistency into the generative and inference processes, SQAIR overcomes the challenge of identifying overlapping and partially occluded objects, a critical limitation observed in the AIR framework.
Application to Real-World Data: Beyond synthetic datasets like moving MNIST, SQAIR's capabilities are extended to real-world pedestrian tracking using CCTV data. This highlights the model's versatility and its potential applicability in various practical scenarios where video data is prevalent.

Methodology

Generative Model: The SQAIR model is articulated as a combination of two processes: Propagation and Discovery. Propagation refers to leveraging past knowledge about an object to infer its current state, while Discovery adds new latent variables for objects appearing in the sequence, facilitating detection of new entrants or emerging details.
Inference Process: The inference process of SQAIR employs multiple RNNs to manage temporal dynamics and relations between objects, thereby ensuring accurate object tracking. In particular, the model uses an inference network that accommodates changes in object presence, appearance, and location over time.
Evaluation Metrics: Performance is evaluated using several metrics, including log marginal likelihood, object counting accuracy, and interpretability of latent variables. SQAIR demonstrates superior performance over baseline models like AIR and VRNN, particularly in terms of temporal consistency and reconstruction error.

Implications and Future Work

The implications of SQAIR are profound for applications requiring robust, unsupervised tracking of multiple objects across video sequences. This includes fields like autonomous driving, surveillance, and human-computer interaction, where understanding the dynamics and interactions of moving entities in a scene is essential.

Future research could focus on scaling the model to handle more complex scenes with dynamic backgrounds, improving efficiency for real-time applications, and exploring adversarial training methods to further enhance the generative capabilities. Additionally, integrating SQAIR into reinforcement learning frameworks could bolster the development of intelligent agents capable of sophisticated decision-making in temporally dynamic environments.

In conclusion, the SQAIR model represents a significant advancement in deep generative modeling for sequential data, providing a foundation for future exploration and development in the domain of unsupervised video analysis.

PDF Markdown

Related Papers

YouTube

Show All Videos