Online Adaptation of Convolutional Neural Networks for Video Object Segmentation (1706.09364v2)

Published 28 Jun 2017 in cs.CV

Abstract: We tackle the task of semi-supervised video object segmentation, i.e. segmenting the pixels belonging to an object in the video using the ground truth pixel mask for the first frame. We build on the recently introduced one-shot video object segmentation (OSVOS) approach which uses a pretrained network and fine-tunes it on the first frame. While achieving impressive performance, at test time OSVOS uses the fine-tuned network in unchanged form and is not able to adapt to large changes in object appearance. To overcome this limitation, we propose Online Adaptive Video Object Segmentation (OnAVOS) which updates the network online using training examples selected based on the confidence of the network and the spatial configuration. Additionally, we add a pretraining step based on objectness, which is learned on PASCAL. Our experiments show that both extensions are highly effective and improve the state of the art on DAVIS to an intersection-over-union score of 85.7%.

Authors (2)

Paul Voigtlaender (24 papers)
Bastian Leibe (94 papers)

Citations (385)

View on Semantic Scholar

Summary

The paper introduces OnAVOS, an online adaptation strategy that continuously updates CNN weights to handle changing object appearances in video sequences.
It leverages a pretraining stage on PASCAL followed by fine-tuning on DAVIS, resulting in state-of-the-art segmentation performance with an 85.7% IoU score.
The approach employs robust online training example selection and dynamic parameter adjustment to mitigate model drift and enhance segmentation stability.

Online Adaptation of Convolutional Neural Networks for Video Object Segmentation

The paper presented by Voigtlaender and Leibe introduces a novel approach to semi-supervised video object segmentation that substantially extends the one-shot video object segmentation (OSVOS) method by incorporating online adaptation. This approach is titled Online Adaptive Video Object Segmentation (OnAVOS) and leverages an online update mechanism to address limitations in adaptability faced by OSVOS when dealing with significant changes in object appearances across video sequences.

Key Contributions

Online Adaptation for VOS: Unlike OSVOS, which operates with a static model at test time, OnAVOS employs an online adaptation strategy. This enables the model to update continuously as new frames are processed, based on dynamically selected training examples. This adaptation is critical in maintaining segmentation performance across frames where there is noticeable variation in object appearance.
Pretraining Steps: The OnAVOS framework introduces a pretraining step that leverages objectness, learned from the PASCAL dataset. This step precedes domain-specific fine-tuning on the DAVIS dataset, which ensures that the model is well-prepared for the specific characteristics of video object segmentation tasks.
Network Architecture: OnAVOS builds upon a more recent and sophisticated network architecture than OSVOS, using a wide ResNet variant with advanced segmentation strategies. This architecture aids in capturing more contextual information necessary for precise video segmentation.
Robust Online Training Example Selection: To prevent model drift, OnAVOS selects training examples with high confidence scores and adjusts its online learning rate and loss function weights dynamically, prioritizing stability in segmentation.

Experimental Evaluation and Results

On the benchmark DAVIS dataset, OnAVOS achieves a state-of-the-art intersection-over-union score of 85.7%, surpassing existing methods such as OSVOS and MaskTrack. Furthermore, through experiments conducted on YouTube-Objects, OnAVOS demonstrates robustness and adaptability across different data domains without extensive parameter retuning.

Implications and Future Directions

The OnAVOS approach exemplifies the potential of incorporating online adaptability into convolutional neural network frameworks for video object segmentation. Its impressive performance indicates feasibility for deployment in environments where objects undergo rapid transformations, such as autonomous driving or interactive video editing.

Looking forward, this research advocates for the integration of temporal context more explicitly within deep learning models for segmentation. While OnAVOS holds substantial promise, further refinement could encompass leveraging motion prediction and long-term temporal dependencies. These factors might be operationalized through hybrid models combining convolutional neural networks with recurrent architectures or attention mechanisms.

In conclusion, OnAVOS establishes a significant stride in the field of video object segmentation by promoting dynamic adaptability through online learning. Its applicability and impressive numerical results suggest numerous potential exploration paths in advancing computer vision technologies.

PDF Markdown