Motion-similarity retrieval in the wild

Establish methods and video representations that achieve high-accuracy motion-similarity retrieval on unconstrained real-world videos with unsynchronized motion, retrieving clips that share the same underlying motion regardless of appearance, scene context, or timing variation.

Background

The paper introduces SemanticMoments, a training-free representation that summarizes temporal statistics (mean, variance, skewness) over semantic features to retrieve videos by motion rather than appearance. While the approach performs strongly on controlled synthetic data, the authors emphasize the importance of real-world evaluation.

SimMotion-Real is a human-annotated benchmark composed of unconstrained, in-the-wild videos where positive pairs share perceptually similar motion despite differences in appearance and timing. On this benchmark, all evaluated methods—including SemanticMoments—show relatively low absolute retrieval accuracy, indicating that robust motion-similarity retrieval in real-world settings remains unresolved.

References

Despite these gains, the absolute numbers indicate that motion-similarity retrieval in the wild remains a challenging open problem.

SemanticMoments: Training-Free Motion Similarity via Third Moment Features  (2602.09146 - Huberman et al., 9 Feb 2026) in Subsection: Evaluation on SimMotion-Real (Experiments)