Learning Motion Patterns in Videos (1612.07217v2)

Published 21 Dec 2016 in cs.CV

Abstract: The problem of determining whether an object is in motion, irrespective of camera motion, is far from being solved. We address this challenging task by learning motion patterns in videos. The core of our approach is a fully convolutional network, which is learned entirely from synthetic video sequences, and their ground-truth optical flow and motion segmentation. This encoder-decoder style architecture first learns a coarse representation of the optical flow field features, and then refines it iteratively to produce motion labels at the original high-resolution. We further improve this labeling with an objectness map and a conditional random field, to account for errors in optical flow, and also to focus on moving "things" rather than "stuff". The output label of each pixel denotes whether it has undergone independent motion, i.e., irrespective of camera motion. We demonstrate the benefits of this learning framework on the moving object segmentation task, where the goal is to segment all objects in motion. Our approach outperforms the top method on the recently released DAVIS benchmark dataset, comprising real-world sequences, by 5.6%. We also evaluate on the Berkeley motion segmentation database, achieving state-of-the-art results.

Authors (3)

Pavel Tokmakov (32 papers)
Karteek Alahari (48 papers)
Cordelia Schmid (206 papers)

Citations (261)

View on Semantic Scholar

Summary

Learning Motion Patterns in Videos: A CNN-Based Framework

The paper under review introduces a novel approach for learning motion patterns in videos, specifically targeting the segmentation of independently moving objects in video sequences. The proposed framework leverages a fully convolutional network (FCN), specifically designed to learn from synthetic video sequences with accompanying ground-truth optical flow and motion segmentation. The architecture is composed of an encoder-decoder style network, which gains initial spatial understanding and progressively refines this understanding to achieve high-resolution motion labels. The paper presents strong empirical evidence demonstrating the efficacy of the framework, backed by comparisons against benchmark datasets.

Summary of Methods and Architecture

The central component of the proposed method is the motion pattern network (MP-Net), which is a convolutional neural network modeled after encoder-decoder architectures. This network is trained to discern motion patterns using optical flow as input, through a two-label classification task aimed at separating independent object motion from camera movement. The network ingests the optical flow field between consecutive video frames, processing the data through encoding and decoding layers to produce per-pixel motion labels initially at a lower resolution but refined back to the original resolution.

The training phase employs the FlyingThings3D dataset, where synthetic scenes with labeled moving objects are used to provide the necessary data for learning. Through this training regime, the network learns to differentiate motion cues effectively, capturing both object and scene context, which is crucial in resolving motion ambiguities. The network shows improved performance when using synthesized data for training, and it showcases advanced results in real-world data evaluations against benchmark datasets, such as DAVIS and Berkeley motion segmentation datasets.

Experimental Results and Indicative Performance

The paper presents compelling results, demonstrating the model's superior performance on video object segmentation tasks. Notably, the approach outperforms the leading methods on the DAVIS benchmark by achieving a 5.6% improvement in the intersection over union (IoU) score. It also delivers state-of-the-art results on the Berkeley motion segmentation dataset.

To attain these results, various architectural and training choices were explored, such as the use of optical flow versus RGB data, the implication of utilizing synthetic versus real-world data, and the inclusion of auxiliary components like object proposals for refining motion labels. The combination of raw optical flow with objectness maps and a conditional random field (CRF) model further enhances the segmentation accuracy and robustness against errors in optical flow estimation.

Implications and Future Directions

The paper proposes a promising direction for video analysis, suggesting potential advancements in both theoretical and application-focused aspects. It sets the stage for future research in developing end-to-end models that integrate motion cues with semantic segmentation, reinforcing the understanding of scene dynamics in an unsupervised or semi-supervised manner. Additionally, the consideration of temporal integration mechanisms, such as memory modules, could further improve the temporal consistency of the segmentation.

The feasibility of applying this network to annotate large datasets holds significant value for advancing research in video understanding, as manual labeling is often tedious and expensive. Furthermore, incorporating learned motion patterns into higher-level vision tasks, such as action recognition or object tracking, could lead to novel insights and improvements within these domains.

PDF Markdown