Refining Pre-Trained Motion Models (2401.00850v2)

Published 1 Jan 2024 in cs.CV and cs.AI

Abstract: Given the difficulty of manually annotating motion in video, the current best motion estimation methods are trained with synthetic data, and therefore struggle somewhat due to a train/test gap. Self-supervised methods hold the promise of training directly on real video, but typically perform worse. These include methods trained with warp error (i.e., color constancy) combined with smoothness terms, and methods that encourage cycle-consistency in the estimates (i.e., tracking backwards should yield the opposite trajectory as tracking forwards). In this work, we take on the challenge of improving state-of-the-art supervised models with self-supervised training. We find that when the initialization is supervised weights, most existing self-supervision techniques actually make performance worse instead of better, which suggests that the benefit of seeing the new data is overshadowed by the noise in the training signal. Focusing on obtaining a "clean" training signal from real-world unlabelled video, we propose to separate label-making and training into two distinct stages. In the first stage, we use the pre-trained model to estimate motion in a video, and then select the subset of motion estimates which we can verify with cycle-consistency. This produces a sparse but accurate pseudo-labelling of the video. In the second stage, we fine-tune the model to reproduce these outputs, while also applying augmentations on the input. We complement this boot-strapping method with simple techniques that densify and re-balance the pseudo-labels, ensuring that we do not merely train on "easy" tracks. We show that our method yields reliable gains over fully-supervised methods in real videos, for both short-term (flow-based) and long-range (multi-frame) pixel tracking.

References (40)

Citations (4)

View on Semantic Scholar

Summary

The paper introduces a two-phase refinement strategy that generates reliable pseudo-labels using cycle-consistency checks.
It leverages pre-trained models and self-supervised techniques to fine-tune motion estimation on challenging real video data.
Experiments on optical flow and multi-frame tracking reveal consistent accuracy improvements over fully supervised methods.

Overview of the Paper

The paper introduces a novel framework for refining pre-trained motion models to better accommodate real-world video data. This need arises because current motion models are often trained on synthetic data and encounter difficulties when applied to real videos due to discrepancies between training and application conditions. While self-supervised techniques show potential for training directly on real video, they usually underperform compared to their supervised counterparts.

Self-Supervised Challenges in Motion Estimation

Self-supervised motion estimation models historically perform worse than supervised models because they are trained on less precise signals. These models typically rely on color constancy—the assumption that pixel color remains consistent between frames—and on smoothness constraints that penalize rapid changes between adjacent pixels. Additionally, these models aim for cycle-consistency, where a motion trajectory should reverse predictably if time is reversed.

The authors propose a two-phase process beginning with pseudo-label generation. They employ pre-trained models to estimate motion in videos and select reliable subsets of motion estimates verified by cycle-consistency. The model produces sparse yet accurate pseudo-labels—effectively estimated true labels. The second phase involves model fine-tuning using these pseudo-labels, challenging the model to replicate its own most reliable estimates even in more complex, augmented input scenarios. Techniques are introduced to densify pseudo-label sets and avoid bias toward simpler motion tracks.

Results and Potential Impact

The methodology yields improvements in motion model prediction accuracy on real videos. Experiments conducted on optical flow and multi-frame point tracking models demonstrate consistent enhancements when compared to fully-supervised methods alone. The authors hope their work will stimulate further exploration into pre-trained motion model refinement.

The paper's findings have clear implications for advancing the state of video motion analysis, potentially contributing to numerous applications in surveillance, autonomous systems, and filmmaking, where accurate motion tracking is essential.