Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
133 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

DELTA: Dense Efficient Long-range 3D Tracking for any video (2410.24211v3)

Published 31 Oct 2024 in cs.CV

Abstract: Tracking dense 3D motion from monocular videos remains challenging, particularly when aiming for pixel-level precision over long sequences. We introduce DELTA, a novel method that efficiently tracks every pixel in 3D space, enabling accurate motion estimation across entire videos. Our approach leverages a joint global-local attention mechanism for reduced-resolution tracking, followed by a transformer-based upsampler to achieve high-resolution predictions. Unlike existing methods, which are limited by computational inefficiency or sparse tracking, DELTA delivers dense 3D tracking at scale, running over 8x faster than previous methods while achieving state-of-the-art accuracy. Furthermore, we explore the impact of depth representation on tracking performance and identify log-depth as the optimal choice. Extensive experiments demonstrate the superiority of DELTA on multiple benchmarks, achieving new state-of-the-art results in both 2D and 3D dense tracking tasks. Our method provides a robust solution for applications requiring fine-grained, long-term motion tracking in 3D space.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (77)
  1. Drivetrack: A benchmark for long-range point tracking in real-world videos. In CVPR, 2024.
  2. Zoedepth: Zero-shot transfer by combining relative and metric depth, 2023. URL https://arxiv.org/abs/2302.12288.
  3. High accuracy optical flow estimation based on a theory for warping. In ECCV, 2004.
  4. A naturalistic open source movie for optical flow evaluation. In Computer Vision–ECCV 2012: 12th European Conference on Computer Vision, Florence, Italy, October 7-13, 2012, Proceedings, Part VI 12, pp.  611–625. Springer, 2012.
  5. End-to-end object detection with transformers. In ECCV, 2020.
  6. Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR, 2017.
  7. Leap-vo: Long-term effective any point tracking for visual odometry. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  19844–19853, 2024.
  8. Flowtrack: Revisiting optical flow for long-range dense tracking. In CVPR, 2024a.
  9. Local all-pair correspondence for point tracking. arXiv preprint arXiv:2407.15420, 2024b.
  10. Density estimation using real nvp. ArXiv, abs/1605.08803, 2016. URL https://api.semanticscholar.org/CorpusID:8768364.
  11. TAP-vid: A benchmark for tracking any point in a video. NeurIPS, 2022.
  12. TAPIR: Tracking any point with per-frame initialization and temporal refinement. In ICCV, 2023.
  13. Rethinking optical flow from geometric matching consistent perspective. In CVPR, 2023.
  14. Depth map prediction from a single image using a multi-scale deep network. NeurIPS, 2014.
  15. Starflow: A spatiotemporal recurrent cell for lightweight multi-frame optical flow estimation. In ICPR, 2021.
  16. Kubric: a scalable dataset generator. In CVPR, 2022.
  17. Hplflownet: Hierarchical permutohedral lattice flownet for scene flow estimation on large-scale point clouds. In CVPR, 2019.
  18. Kinecting the dots: Particle based scene flow from depth sensors. In ICCV, 2011.
  19. Particle video revisited: Tracking through occlusions using point trajectories. In ECCV, 2022.
  20. BK Horn and B Schunck. Determining optical flow (artificial intelligence laboratory). Massachusetts Institute of Technology, Cambridge, MA, 1980.
  21. Sphereflow: 6 dof scene flow from rgb-d pairs. In CVPR, 2014.
  22. Depthcrafter: Generating consistent long depth sequences for open-world videos, 2024. URL https://arxiv.org/abs/2409.02095.
  23. Flowformer: A transformer architecture for optical flow. In ECCV, 2022.
  24. Flownet 2.0: Evolution of optical flow estimation with deep networks. In CVPR, 2017.
  25. Learning to estimate hidden motions with global motion aggregation. In ICCV, 2021.
  26. Panoptic studio: A massively multiview system for social interaction. TPAMI, 2017.
  27. Cotracker: It is better to track together. arXiv:2307.07635, 2023.
  28. Robust consistent video depth estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  1611–1621, 2021.
  29. Tapvid-3d: A benchmark for tracking any point in 3d, 2024. URL https://arxiv.org/abs/2407.05921.
  30. Dense optical tracking: Connecting the dots. In CVPR, 2024.
  31. Beyond pick-and-place: Tackling robotic stacking of diverse shapes. In CoRL, 2021.
  32. Taptrv2: Attention-based position update improves tracking any point. arXiv preprint arXiv:2407.16291, 2024a.
  33. Taptr: Tracking any point with transformers as detection. arXiv preprint arXiv:2403.13042, 2024b.
  34. Flownet3d: Learning scene flow in 3d point clouds. In CVPR, 2019.
  35. Dense estimation and object-based segmentation of the optical flow with robust techniques. TIP, 1998.
  36. Nerf: Representing scenes as neural radiance fields for view synthesis. In ECCV, 2020.
  37. Mft: Long-term tracking of every pixel. In WACV, 2024.
  38. Occupancy flow: 4d reconstruction by learning particle dynamics. In ICCV, 2019.
  39. Dinov2: Learning robust visual features without supervision, 2023.
  40. Aria digital twin: A new benchmark dataset for egocentric 3d machine perception. In ICCV, 2023.
  41. UniDepth: Universal monocular metric depth estimation. In CVPR, 2024.
  42. The 2017 davis challenge on video object segmentation. arXiv preprint arXiv:1704.00675, 2017.
  43. Train short, test long: Attention with linear biases enables input length extrapolation. In ICLR, 2022.
  44. Dense semi-rigid scene flow estimation from rgbd images. In ECCV, 2014.
  45. Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. TPAMI, 2022.
  46. Optical flow estimation using a spatial pyramid network. In CVPR, 2017.
  47. Particle video: Long-range motion estimation using point trajectories. IJCV, 2008.
  48. Perceiving visual expansion without optic flow. Nature, 2001.
  49. Videoflow: Exploiting temporal cues for multi-frame optical flow estimation. In ICCV, 2023.
  50. Super-convergence: Very fast training of neural networks using large learning rates. In Artificial intelligence and machine learning for multi-domain operations applications, 2019.
  51. Track everything everywhere fast and robustly. ArXiv, abs/2403.17931, 2024. URL https://api.semanticscholar.org/CorpusID:268691853.
  52. A benchmark for the evaluation of rgb-d slam systems. In 2012 IEEE/RSJ international conference on intelligent robots and systems, pp.  573–580. IEEE, 2012.
  53. Pwc-net: Cnns for optical flow using pyramid, warping, and cost volume. In CVPR, 2018.
  54. Perceived size and motion in depth from optical expansion. Perception & psychophysics, 1986.
  55. Raft: Recurrent all-pairs field transforms for optical flow. In ECCV, 2020.
  56. Droid-slam: Deep visual slam for monocular, stereo, and rgb-d cameras. Advances in neural information processing systems, 34:16558–16569, 2021a.
  57. Raft-3d: Scene flow using rigid-motion embeddings. In CVPR, 2021b.
  58. Deep patch visual odometry. Advances in Neural Information Processing Systems, 36, 2024.
  59. Dino-tracker: Taming dino for self-supervised point tracking in a single video, 2024.
  60. 3d scene flow estimation with a piecewise rigid scene model. IJCV, 2015.
  61. Scenetracker: Long-term scene flow estimation network. arXiv preprint arXiv:2403.19924, 2024a.
  62. Web stereo video supervision for depth prediction from dynamic scenes. In 3DV, 2019.
  63. Tracking everything everywhere all at once. ICCV, 2023a.
  64. Tracking everything everywhere all at once. In ICCV, 2023b.
  65. Dust3r: Geometric 3d vision made easy. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  20697–20709, 2024b.
  66. Sea-raft: Simple, efficient, accurate raft for optical flow. In European Conference on Computer Vision, pp.  36–54. Springer, 2025.
  67. Flownet3d++: Geometric losses for deep scene flow estimation. In WACV, 2020.
  68. Accflow: Backward accumulation for long-range optical flow. In ICCV, 2023.
  69. Spatialtracker: Tracking any 2d pixels in 3d space. In CVPR, 2024.
  70. Gmflow: Learning optical flow via global matching. In CVPR, 2022.
  71. Accurate optical flow via direct cost volume processing. In CVPR, 2017.
  72. Upgrading optical flow to 3d scene flow through optical expansion. In CVPR, 2020.
  73. Learning to segment rigid motions from two frames. In CVPR, 2021.
  74. Monst3r: A simple approach for estimating geometry in the presence of motion. arXiv preprint arXiv:2410.03825, 2024.
  75. Structure and motion from casual videos. In European Conference on Computer Vision, pp.  20–37. Springer, 2022.
  76. Particlesfm: Exploiting dense point trajectories for localizing moving cameras in the wild. In European Conference on Computer Vision, pp.  523–542. Springer, 2022.
  77. Pointodyssey: A large-scale synthetic dataset for long-term point tracking. In ICCV, 2023.
Citations (1)

Summary

  • The paper introduces DELTA, a novel method for dense, pixel-level 3D tracking in monocular videos that enhances both efficiency and accuracy.
  • It uses a coarse-to-fine strategy with global-local spatio-temporal attention and an attention-based transformer upsampler for refined predictions.
  • Experiments demonstrate over a 10% accuracy improvement and an 8x speed increase on benchmarks like CVO and Kubric3D compared to existing methods.

DELTA: Dense Efficient Long-range 3D Tracking for Monocular Videos

The paper introduces DELTA, a novel approach aimed at addressing the challenges of dense 3D tracking in monocular video sequences. The proposed method seeks to establish pixel-level correspondences within 3D space, maintaining both accuracy and computational efficiency over extended sequences. DELTA builds upon the inherent limitations of previous approaches by leveraging a combination of global-local spatio-temporal attention mechanisms and introducing innovative strategies for depth representation.

Overview and Methodology

DELTA is designed to track every pixel in a video efficiently, relying on a coarse-to-fine strategy for dense motion prediction. The process commences with a global-local attention mechanism at a reduced resolution, followed by a transformer-based upsampler that refines predictions to high resolution. A significant aspect of DELTA's efficiency comes from its spatial attention design, which adopts a sparse anchor track approach to capture global structures more economically than existing methods. This allows for the efficient training and deployment of dense tracking tasks without compromising on the granularity of spatial representation.

The upsampler employs an attention-based methodology rather than traditional convolutional upsampling approaches, resulting in enhanced temporal coherence and accuracy. Another pivotal contribution is the empirical demonstration of depth representation's impact, with log-depth being identified as the optimal choice. This representation achieves greater granularity for nearby regions, aligning with the characteristics of monocular depth estimation methods, which are generally more reliable at lesser depths.

Results and Contributions

The evaluation of DELTA on various benchmarks for both 2D and 3D dense tracking tasks highlights its superiority over previous state-of-the-art methods. The experiments indicate an accuracy improvement of over 10% in certain metrics, such as Average Jaccard (AJ) and Average Percent Discrepancy in 3D (APD3D_{3D}), on datasets like CVO and Kubric3D. Furthermore, DELTA operates at over eight times the speed of existing methods in equivalent settings, completing evaluations on long video sequences in approximately two minutes for 100 frames.

These results suggest that DELTA can robustly handle the respective complexities of occlusions, camera motion, and dynamic scene changes—key challenges for video tracking methodologies. The paper's exploration of depth representation provides empirical evidence to support the choice of log-depth, highlighting the influence of such design decisions on tracking performance.

Implications and Future Directions

DELTA's advancements suggest significant potential impacts on applications requiring detailed motion analysis, such as autonomous driving, augmented reality, and video content creation. Its capability to provide pixel-accurate 3D tracking over long sequences offers a valuable tool for these areas, where accurate understanding of object motion and depth are integral.

The architectural innovations in DELTA, particularly the sparse anchor track attention and attention-based upsampling, present a promising direction for future studies in efficient 3D video tracking. As depth estimation technology continues to evolve, particularly with advances in monocular methods, integrating these innovations could further improve accuracy and efficiency.

Further research examining the robustness of DELTA across a broader spectrum of video environments, including varied lighting conditions and video styles, would provide additional insights into its adaptability and potential enhancements. The scalability of DELTA to real-time applications also warrants exploration, as the increasing demand for seamless interaction in digital environments grows.

In conclusion, DELTA marks a substantial advancement in the pursuit of efficient, accurate long-range 3D tracking, setting the groundwork for future explorations in the field of computer vision.