Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
144 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Autogenic Language Embedding for Coherent Point Tracking (2407.20730v1)

Published 30 Jul 2024 in cs.CV

Abstract: Point tracking is a challenging task in computer vision, aiming to establish point-wise correspondence across long video sequences. Recent advancements have primarily focused on temporal modeling techniques to improve local feature similarity, often overlooking the valuable semantic consistency inherent in tracked points. In this paper, we introduce a novel approach leveraging language embeddings to enhance the coherence of frame-wise visual features related to the same object. Our proposed method, termed autogenic language embedding for visual feature enhancement, strengthens point correspondence in long-term sequences. Unlike existing visual-language schemes, our approach learns text embeddings from visual features through a dedicated mapping network, enabling seamless adaptation to various tracking tasks without explicit text annotations. Additionally, we introduce a consistency decoder that efficiently integrates text tokens into visual features with minimal computational overhead. Through enhanced visual consistency, our approach significantly improves tracking trajectories in lengthy videos with substantial appearance variations. Extensive experiments on widely-used tracking benchmarks demonstrate the superior performance of our method, showcasing notable enhancements compared to trackers relying solely on visual cues.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (49)
  1. Context-TAP: Tracking Any Point Demands Spatial Context Features. arXiv preprint arXiv:2306.02000 (2023).
  2. Michael J Black and Padmanabhan Anandan. 1993. A framework for the robust estimation of optical flow. In 1993 (4th) International Conference on Computer Vision. IEEE, 231–236.
  3. Lucas/Kanade meets Horn/Schunck: Combining local and global optic flow methods. International journal of computer vision 61 (2005), 211–231.
  4. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision. 9650–9660.
  5. Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6299–6308.
  6. TAP-Vid: A Benchmark for Tracking Any Point in a Video. In NeurIPS Datasets Track.
  7. TAPIR: Tracking Any Point with per-frame Initialization and temporal Refinement. ICCV (2023).
  8. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. ICLR (2021).
  9. Flownet: Learning optical flow with convolutional networks. In Proceedings of the IEEE international conference on computer vision. 2758–2766.
  10. Siamese natural language tracker: Tracking by natural language descriptions with siamese trackers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5851–5860.
  11. Divert more attention to vision-language tracking. Advances in Neural Information Processing Systems 35 (2022), 4446–4460.
  12. Particle Video Revisited: Tracking Through Occlusions Using Point Trajectories. In ECCV.
  13. Denoising diffusion probabilistic models. Advances in neural information processing systems 33 (2020), 6840–6851.
  14. Berthold KP Horn and Brian G Schunck. 1981. Determining optical flow. Artificial intelligence 17, 1-3 (1981), 185–203.
  15. Flowformer: A transformer architecture for optical flow. In European Conference on Computer Vision. Springer, 668–685.
  16. CoTracker: It is Better to Track Together. arXiv:2307.07635 (2023).
  17. RPPformer-Flow: Relative Position Guided Point Transformer for Scene Flow Estimation. In Proceedings of the 30th ACM International Conference on Multimedia (MM ’22). Association for Computing Machinery, New York, NY, USA, 4867–4876.
  18. Ilya Loshchilov and Frank Hutter. 2018. Decoupled weight decay regularization. In Proceedings of the ICLR.
  19. Bruce D Lucas and Takeo Kanade. 1981. An iterative image registration technique with an application to stereo vision. In IJCAI’81: 7th international joint conference on Artificial intelligence, Vol. 2. 674–679.
  20. Diffusion Hyperfeatures: Searching Through Time and Space for Semantic Correspondence. In Advances in Neural Information Processing Systems, A. Oh, T. Neumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (Eds.), Vol. 36. Curran Associates, Inc., 47500–47510. https://proceedings.neurips.cc/paper_files/paper/2023/file/942032b61720a3fd64897efe46237c81-Paper-Conference.pdf
  21. Deem: Diffusion models serve as the eyes of large language models for image perception. arXiv preprint arXiv:2405.15232 (2024).
  22. DiffusionTrack: Diffusion Model For Multi-Object Tracking. arXiv preprint arXiv:2308.09905 (2023).
  23. Ding Ma and Xiangqian Wu. 2021. Capsule-based Object Tracking with Natural Language Specification. In Proceedings of the 29th ACM International Conference on Multimedia. 1948–1956.
  24. Learning to Compose Hypercolumns for Visual Correspondence. In European Conference on Computer Vision. https://api.semanticscholar.org/CorpusID:220665709
  25. Clipcap: Clip prefix for image captioning. arXiv preprint arXiv:2111.09734 (2021).
  26. MFT: Long-Term Tracking of Every Pixel. arXiv preprint arXiv:2305.12998 (2023).
  27. Rwkv: Reinventing rnns for the transformer era. arXiv preprint arXiv:2305.13048 (2023).
  28. The 2017 davis challenge on video object segmentation. arXiv preprint arXiv:1704.00675 (2017).
  29. Learning transferable visual models from natural language supervision. In International conference on machine learning. PMLR, 8748–8763.
  30. Denseclip: Language-guided dense prediction with context-aware prompting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18082–18091.
  31. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 10684–10695.
  32. Peter Sand and Seth Teller. 2008. Particle video: Long-range motion estimation using point trajectories. International journal of computer vision 80 (2008), 72–91.
  33. Denoising Diffusion Implicit Models. arXiv:2010.02502 (October 2020). https://arxiv.org/abs/2010.02502
  34. Compact transformer tracker with correlative masked modeling. arXiv preprint arXiv:2301.10938 (2023).
  35. Transformer Tracking With Cyclic Shifting Window Attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 8791–8800.
  36. Pwc-net: Cnns for optical flow using pyramid, warping, and cost volume. In Proceedings of the IEEE conference on computer vision and pattern recognition. 8934–8943.
  37. Retentive network: A successor to transformer for large language models. arXiv preprint arXiv:2307.08621 (2023).
  38. Emergent Correspondence from Image Diffusion. In Thirty-seventh Conference on Neural Information Processing Systems. https://openreview.net/forum?id=ypOiXjdfnU
  39. Zachary Teed and Jia Deng. 2020. Raft: Recurrent all-pairs field transforms for optical flow. In Proceedings of the ECCV. Springer, 402–419.
  40. Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34 (2021), 200–212.
  41. Attention is all you need. In NIPS. 5998–6008.
  42. Tracking Everything Everywhere All at Once. arXiv preprint arXiv:2306.05422 (2023).
  43. Towards more flexible and accurate object tracking with natural language: Algorithms and benchmark. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13763–13773.
  44. Accurate optical flow via direct cost volume processing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1289–1297.
  45. Prompting for Multi-Modal Tracking. In Proceedings of the 30th ACM International Conference on Multimedia (MM ’22). Association for Computing Machinery, New York, NY, USA, 3492–3500.
  46. All in One: Exploring Unified Vision-Language Tracking with Multi-Modal Alignment. In Proceedings of the 31st ACM International Conference on Multimedia (MM ’23). Association for Computing Machinery, New York, NY, USA, 5552–5561.
  47. Pointodyssey: A large-scale synthetic dataset for long-term point tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 19855–19865.
  48. Conditional prompt learning for vision-language models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 16816–16825.
  49. MVFlow: Deep Optical Flow Estimation of Compressed Videos with Motion Vector Prior. In Proceedings of the 31st ACM International Conference on Multimedia (MM ’23). Association for Computing Machinery, New York, NY, USA, 1964–1974.
Citations (3)

Summary

We haven't generated a summary for this paper yet.