Space-Time Correspondence as a Contrastive Random Walk (2006.14613v2)

Published 25 Jun 2020 in cs.CV, cs.LG, and eess.IV

Abstract: This paper proposes a simple self-supervised approach for learning a representation for visual correspondence from raw video. We cast correspondence as prediction of links in a space-time graph constructed from video. In this graph, the nodes are patches sampled from each frame, and nodes adjacent in time can share a directed edge. We learn a representation in which pairwise similarity defines transition probability of a random walk, so that long-range correspondence is computed as a walk along the graph. We optimize the representation to place high probability along paths of similarity. Targets for learning are formed without supervision, by cycle-consistency: the objective is to maximize the likelihood of returning to the initial node when walking along a graph constructed from a palindrome of frames. Thus, a single path-level constraint implicitly supervises chains of intermediate comparisons. When used as a similarity metric without adaptation, the learned representation outperforms the self-supervised state-of-the-art on label propagation tasks involving objects, semantic parts, and pose. Moreover, we demonstrate that a technique we call edge dropout, as well as self-supervised adaptation at test-time, further improve transfer for object-centric correspondence.

Authors (3)

Allan Jabri (17 papers)
Andrew Owens (52 papers)
Alexei A. Efros (100 papers)

Citations (270)

View on Semantic Scholar

Summary

Space-Time Correspondence as a Contrastive Random Walk: An Overview

The paper "Space-Time Correspondence as a Contrastive Random Walk" presents a novel approach to self-supervised learning of visual representations from raw video data. The authors propose a method to capture temporal correspondence in video by framing it as a link prediction problem on a space-time graph. This graph is formed from video frames represented as nodes, with edges denoting affinities based on learned features. The main innovation lies in employing a contrastive random walk to discover and reinforce temporal correspondences without requiring labeled data.

Core Methodology

The proposed method models video as a directed graph where nodes are image patches, and edges represent temporal adjacency between these nodes across frames. The authors aim to learn representations that boost the likelihood of traversing long-range temporal correspondences through a random walk formulated on this graph.

Graph Construction: Each frame in a video provides a set of nodes obtained by sampling image patches. The representation of these nodes is learned using a convolutional encoder, resulting in d-dimensional vectors normalized over a pairwise similarity matrix. A stochastic affinity matrix forms the directed edges between nodes in consecutive frames.
Learning Objective: The learning task is to maximize the likelihood of a random walker returning to its initial position, facilitated by a cycle-consistency methodology on palindrome sequences. The cycle-consistency evokes implicit supervision, offering a strong self-supervised learning signal devoid of labels.
Contrastive Loss and Adaptation: A contrastive loss function reinforces paths that successfully reach the initial node, thereby implicitly learning intermediate correspondences. Additionally, the application of edge dropout and test-time adaptation is explored to enhance the robustness of the learned representation.

Experimental Evaluation

The presentation of results highlights strong numerical performance, where the learned representation surpasses self-supervised state-of-the-art methods in various label propagation tasks, such as video object segmentation, human pose tracking, and semantic part segmentation. The model demonstrates particularly impressive results in video object segmentation on the DAVIS 2017 dataset, showing competitive results even against some fully-supervised approaches.

Implications and Future Prospects

The implications of this work extend to several domains where temporal correspondence is applicable, such as video analysis for object tracking, understanding dynamic scenes, and action recognition. By formulating temporal correspondence as a pathfinding task in a graph structure, the research establishes a generic framework for learning robust visual features without reliance on explicit labels.

Looking ahead, further exploration into multi-hop or long-range temporal semantic relationships could be beneficial. Incorporating advancements in graph attention mechanisms or integrating additional modalities like audio or depth could extend the capabilities of the model. Additionally, refining the pretext task to capture more granular temporal dynamics may offer further enhancements in performance.

Conclusion

"Space-Time Correspondence as a Contrastive Random Walk" contributes a methodologically sound, self-supervised technique for learning visual representations, addressing the complexities inherent in temporal correspondence. By leveraging contrastive learning within a graph-theoretic framework, it opens new avenues for efficiently handling video data and extracting semantically meaningful feature representations. Future work could potentially build upon this foundation to tackle even more intricate problems in dynamic visual environments.

PDF Markdown

Related Papers

Find Related Papers

YouTube

Show All Videos