DELTA: Dense Efficient Long-range 3D Tracking for any video (2410.24211v3)
Abstract: Tracking dense 3D motion from monocular videos remains challenging, particularly when aiming for pixel-level precision over long sequences. We introduce DELTA, a novel method that efficiently tracks every pixel in 3D space, enabling accurate motion estimation across entire videos. Our approach leverages a joint global-local attention mechanism for reduced-resolution tracking, followed by a transformer-based upsampler to achieve high-resolution predictions. Unlike existing methods, which are limited by computational inefficiency or sparse tracking, DELTA delivers dense 3D tracking at scale, running over 8x faster than previous methods while achieving state-of-the-art accuracy. Furthermore, we explore the impact of depth representation on tracking performance and identify log-depth as the optimal choice. Extensive experiments demonstrate the superiority of DELTA on multiple benchmarks, achieving new state-of-the-art results in both 2D and 3D dense tracking tasks. Our method provides a robust solution for applications requiring fine-grained, long-term motion tracking in 3D space.
- Drivetrack: A benchmark for long-range point tracking in real-world videos. In CVPR, 2024.
- Zoedepth: Zero-shot transfer by combining relative and metric depth, 2023. URL https://arxiv.org/abs/2302.12288.
- High accuracy optical flow estimation based on a theory for warping. In ECCV, 2004.
- A naturalistic open source movie for optical flow evaluation. In Computer Vision–ECCV 2012: 12th European Conference on Computer Vision, Florence, Italy, October 7-13, 2012, Proceedings, Part VI 12, pp. 611–625. Springer, 2012.
- End-to-end object detection with transformers. In ECCV, 2020.
- Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR, 2017.
- Leap-vo: Long-term effective any point tracking for visual odometry. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19844–19853, 2024.
- Flowtrack: Revisiting optical flow for long-range dense tracking. In CVPR, 2024a.
- Local all-pair correspondence for point tracking. arXiv preprint arXiv:2407.15420, 2024b.
- Density estimation using real nvp. ArXiv, abs/1605.08803, 2016. URL https://api.semanticscholar.org/CorpusID:8768364.
- TAP-vid: A benchmark for tracking any point in a video. NeurIPS, 2022.
- TAPIR: Tracking any point with per-frame initialization and temporal refinement. In ICCV, 2023.
- Rethinking optical flow from geometric matching consistent perspective. In CVPR, 2023.
- Depth map prediction from a single image using a multi-scale deep network. NeurIPS, 2014.
- Starflow: A spatiotemporal recurrent cell for lightweight multi-frame optical flow estimation. In ICPR, 2021.
- Kubric: a scalable dataset generator. In CVPR, 2022.
- Hplflownet: Hierarchical permutohedral lattice flownet for scene flow estimation on large-scale point clouds. In CVPR, 2019.
- Kinecting the dots: Particle based scene flow from depth sensors. In ICCV, 2011.
- Particle video revisited: Tracking through occlusions using point trajectories. In ECCV, 2022.
- BK Horn and B Schunck. Determining optical flow (artificial intelligence laboratory). Massachusetts Institute of Technology, Cambridge, MA, 1980.
- Sphereflow: 6 dof scene flow from rgb-d pairs. In CVPR, 2014.
- Depthcrafter: Generating consistent long depth sequences for open-world videos, 2024. URL https://arxiv.org/abs/2409.02095.
- Flowformer: A transformer architecture for optical flow. In ECCV, 2022.
- Flownet 2.0: Evolution of optical flow estimation with deep networks. In CVPR, 2017.
- Learning to estimate hidden motions with global motion aggregation. In ICCV, 2021.
- Panoptic studio: A massively multiview system for social interaction. TPAMI, 2017.
- Cotracker: It is better to track together. arXiv:2307.07635, 2023.
- Robust consistent video depth estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1611–1621, 2021.
- Tapvid-3d: A benchmark for tracking any point in 3d, 2024. URL https://arxiv.org/abs/2407.05921.
- Dense optical tracking: Connecting the dots. In CVPR, 2024.
- Beyond pick-and-place: Tackling robotic stacking of diverse shapes. In CoRL, 2021.
- Taptrv2: Attention-based position update improves tracking any point. arXiv preprint arXiv:2407.16291, 2024a.
- Taptr: Tracking any point with transformers as detection. arXiv preprint arXiv:2403.13042, 2024b.
- Flownet3d: Learning scene flow in 3d point clouds. In CVPR, 2019.
- Dense estimation and object-based segmentation of the optical flow with robust techniques. TIP, 1998.
- Nerf: Representing scenes as neural radiance fields for view synthesis. In ECCV, 2020.
- Mft: Long-term tracking of every pixel. In WACV, 2024.
- Occupancy flow: 4d reconstruction by learning particle dynamics. In ICCV, 2019.
- Dinov2: Learning robust visual features without supervision, 2023.
- Aria digital twin: A new benchmark dataset for egocentric 3d machine perception. In ICCV, 2023.
- UniDepth: Universal monocular metric depth estimation. In CVPR, 2024.
- The 2017 davis challenge on video object segmentation. arXiv preprint arXiv:1704.00675, 2017.
- Train short, test long: Attention with linear biases enables input length extrapolation. In ICLR, 2022.
- Dense semi-rigid scene flow estimation from rgbd images. In ECCV, 2014.
- Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. TPAMI, 2022.
- Optical flow estimation using a spatial pyramid network. In CVPR, 2017.
- Particle video: Long-range motion estimation using point trajectories. IJCV, 2008.
- Perceiving visual expansion without optic flow. Nature, 2001.
- Videoflow: Exploiting temporal cues for multi-frame optical flow estimation. In ICCV, 2023.
- Super-convergence: Very fast training of neural networks using large learning rates. In Artificial intelligence and machine learning for multi-domain operations applications, 2019.
- Track everything everywhere fast and robustly. ArXiv, abs/2403.17931, 2024. URL https://api.semanticscholar.org/CorpusID:268691853.
- A benchmark for the evaluation of rgb-d slam systems. In 2012 IEEE/RSJ international conference on intelligent robots and systems, pp. 573–580. IEEE, 2012.
- Pwc-net: Cnns for optical flow using pyramid, warping, and cost volume. In CVPR, 2018.
- Perceived size and motion in depth from optical expansion. Perception & psychophysics, 1986.
- Raft: Recurrent all-pairs field transforms for optical flow. In ECCV, 2020.
- Droid-slam: Deep visual slam for monocular, stereo, and rgb-d cameras. Advances in neural information processing systems, 34:16558–16569, 2021a.
- Raft-3d: Scene flow using rigid-motion embeddings. In CVPR, 2021b.
- Deep patch visual odometry. Advances in Neural Information Processing Systems, 36, 2024.
- Dino-tracker: Taming dino for self-supervised point tracking in a single video, 2024.
- 3d scene flow estimation with a piecewise rigid scene model. IJCV, 2015.
- Scenetracker: Long-term scene flow estimation network. arXiv preprint arXiv:2403.19924, 2024a.
- Web stereo video supervision for depth prediction from dynamic scenes. In 3DV, 2019.
- Tracking everything everywhere all at once. ICCV, 2023a.
- Tracking everything everywhere all at once. In ICCV, 2023b.
- Dust3r: Geometric 3d vision made easy. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 20697–20709, 2024b.
- Sea-raft: Simple, efficient, accurate raft for optical flow. In European Conference on Computer Vision, pp. 36–54. Springer, 2025.
- Flownet3d++: Geometric losses for deep scene flow estimation. In WACV, 2020.
- Accflow: Backward accumulation for long-range optical flow. In ICCV, 2023.
- Spatialtracker: Tracking any 2d pixels in 3d space. In CVPR, 2024.
- Gmflow: Learning optical flow via global matching. In CVPR, 2022.
- Accurate optical flow via direct cost volume processing. In CVPR, 2017.
- Upgrading optical flow to 3d scene flow through optical expansion. In CVPR, 2020.
- Learning to segment rigid motions from two frames. In CVPR, 2021.
- Monst3r: A simple approach for estimating geometry in the presence of motion. arXiv preprint arXiv:2410.03825, 2024.
- Structure and motion from casual videos. In European Conference on Computer Vision, pp. 20–37. Springer, 2022.
- Particlesfm: Exploiting dense point trajectories for localizing moving cameras in the wild. In European Conference on Computer Vision, pp. 523–542. Springer, 2022.
- Pointodyssey: A large-scale synthetic dataset for long-term point tracking. In ICCV, 2023.