ADA-Track++: End-to-End Multi-Camera 3D Multi-Object Tracking with Alternating Detection and Association (2405.08909v2)
Abstract: Many query-based approaches for 3D Multi-Object Tracking (MOT) adopt the tracking-by-attention paradigm, utilizing track queries for identity-consistent detection and object queries for identity-agnostic track spawning. Tracking-by-attention, however, entangles detection and tracking queries in one embedding for both the detection and tracking task, which is sub-optimal. Other approaches resemble the tracking-by-detection paradigm and detect objects using decoupled track and detection queries followed by a subsequent association. These methods, however, do not leverage synergies between the detection and association task. Combining the strengths of both paradigms, we introduce ADA-Track++, a novel end-to-end framework for 3D MOT from multi-view cameras. We introduce a learnable data association module based on edge-augmented cross-attention, leveraging appearance and geometric features. We also propose an auxiliary token in this attention-based association module, which helps mitigate disproportionately high attention to incorrect association targets caused by attention normalization. Furthermore, we integrate this association module into the decoder layer of a DETR-based 3D detector, enabling simultaneous DETR-like query-to-image cross-attention for detection and query-to-query cross-attention for data association. By stacking these decoder layers, queries are refined for the detection and association task alternately, effectively harnessing the task dependencies. We evaluate our method on the nuScenes dataset and demonstrate the advantage of our approach compared to the two previous paradigms.
- Bot-sort: Robust associations multi-pedestrian tracking. ArXiv, abs/2206.14651, 2022.
- Score refinement for confidence-based 3d multi-object tracking. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 8083–8090. IEEE, 2021.
- Evaluating multiple object tracking performance: The clear mot metrics. EURASIP Journal on Image and Video Processing, 2008:1–10, 2008.
- Simple online and realtime tracking. IEEE International Conference on Image Processing (ICIP), pages 3464–3468, 2016.
- Learning a neural solver for multiple object tracking. In IEEE/CVF conference on computer vision and pattern recognition, pages 6247–6257, 2020.
- nuscenes: A multimodal dataset for autonomous driving. In IEEE/CVF conference on computer vision and pattern recognition, pages 11621–11631, 2020.
- Observation-centric sort: Rethinking sort for robust multi-object tracking. In IEEE/CVF conference on computer vision and pattern recognition, pages 9686–9696, 2023.
- End-to-end object detection with transformers. In European conference on computer vision, pages 213–229. Springer, 2020.
- Deft: Detection embeddings for tracking. ArXiv, abs/2102.02267, 2021.
- Rest: A reconfigurable spatial-temporal graph model for multi-camera multi-object tracking. In IEEE/CVF International Conference on Computer Vision, pages 10051–10060, 2023.
- Probabilistic 3d multi-object tracking for autonomous driving. ArXiv, abs/2001.05673, 2020.
- Probabilistic 3d multi-modal, multi-object tracking for autonomous driving. In IEEE international conference on robotics and automation (ICRA), pages 14227–14233. IEEE, 2021.
- Transmot: Spatial-temporal graph transformer for multiple object tracking. In IEEE/CVF Winter Conference on Applications of Computer Vision, pages 4870–4880, 2023.
- Mot20: A benchmark for multi object tracking in crowded scenes. ArXiv, abs/2003.09003, 2020.
- End-to-end single shot detector using graph-based learnable duplicate removal. In DAGM German Conference on Pattern Recognition, pages 375–389. Springer, 2022.
- 3dmotformer: Graph transformer for online 3d multi-object tracking. In IEEE/CVF International Conference on Computer Vision, pages 9784–9794, 2023.
- Spatialdetr: Robust scalable transformer-based 3d object detection from multi-view camera images with global cross-sensor attention. In European Conference on Computer Vision, pages 230–245. Springer, 2022.
- Star-track: Latent motion models for end-to-end 3d object tracking with adaptive spatio-temporal appearance representations. IEEE Robotics and Automation Letters, 2023.
- Strongsort: Make deepsort great again. IEEE Transactions on Multimedia, 2023.
- Cc-3dt: Panoramic 3d object tracking via cross-camera fusion. In Conference on Robot Learning, pages 2294–2305. PMLR, 2023.
- Deep residual learning for image recognition. In IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
- Learning non-maximum suppression. In IEEE conference on computer vision and pattern recognition, pages 4507–4515, 2017.
- Fiery: Future instance prediction in bird’s-eye view from surround monocular cameras. In IEEE/CVF International Conference on Computer Vision, pages 15273–15282, 2021.
- Monocular quasi-dense 3d object tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(2):1992–2008, 2022.
- Bevdet4d: Exploit temporal cues in multi-camera 3d object detection. ArXiv, abs/2203.17054, 2022.
- Bevdet: High-performance multi-camera 3d object detection in bird-eye-view. ArXiv, abs/2112.11790, 2021.
- Global self-attention as a replacement for graph convolution. In 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 655–665, 2022.
- Harold W. Kuhn. The hungarian method for the assignment problem. Naval Research Logistics (NRL), 52, 1955.
- An energy and gpu-computation efficient backbone network for real-time object detection. In IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2019.
- Time3d: End-to-end joint monocular 3d object detection and tracking for autonomous driving. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3885–3894, 2022.
- Powerbev: A powerful yet lightweight framework for instance prediction in bird’s-eye view. In Thirty-Second International Joint Conference on Artificial Intelligence, pages 1080–1088, 2023a.
- Poly-mot: A polyhedral framework for 3d multi-object tracking. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 9391–9398. IEEE, 2023b.
- Bevstereo: Enhancing depth estimation in multi-view 3d object detection with temporal stereo. In AAAI Conference on Artificial Intelligence, pages 1486–1494, 2023c.
- End-to-end 3d tracking with decoupled queries. In IEEE/CVF International Conference on Computer Vision, pages 18302–18311, 2023d.
- Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. In European conference on computer vision, pages 1–18. Springer, 2022.
- Feature pyramid networks for object detection. In IEEE conference on computer vision and pattern recognition, pages 2117–2125, 2017a.
- Focal loss for dense object detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42:318–327, 2017b.
- Sparse4d: Multi-view 3d object detection with sparse spatial-temporal fusion. ArXiv, abs/2211.10581, 2022.
- Sparse4d v2: Recurrent temporal fusion with sparse model. ArXiv, abs/2305.14018, 2023.
- Petr: Position embedding transformation for multi-view 3d object detection. In European Conference on Computer Vision, pages 531–548. Springer, 2022.
- Decoupled weight decay regularization. In International Conference on Learning Representations, 2018.
- Trackformer: Multi-object tracking with transformers. In IEEE/CVF conference on computer vision and pattern recognition, pages 8844–8854, 2022.
- Mot16: A benchmark for multi-object tracking. ArXiv, abs/1603.00831, 2016.
- Lmgp: Lifted multicut meets geometry projections for multi-camera multi-object tracking. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8866–8875, 2022.
- Simpletrack: Understanding and rethinking 3d multi-object tracking. In European Conference on Computer Vision, pages 680–696. Springer, 2022.
- Standing between past and future: Spatio-temporal modeling for multi-camera 3d multi-object tracking. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17928–17938, 2023.
- Time will tell: New outlooks and a baseline for temporal multi-view 3d object detection. In International Conference on Learning Representations, 2022.
- Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d. In European Conference on Computer Vision, pages 194–210. Springer, 2020.
- Dyglip: A dynamic graph model with link prediction for accurate multi-camera multiple object tracking. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13784–13793, 2021.
- Trackmpnn: A message passing graph neural architecture for multi-object tracking. ArXiv, abs/2101.04206, 2021.
- Transtrack: Multiple object tracking with transformer. ArXiv, abs/2012.15460, 2020.
- Camo-mot: Combined appearance-motion optimization for 3d multi-object tracking with camera-lidar fusion. IEEE Transactions on Intelligent Transportation Systems, 2023a.
- Immortal tracker: Tracklet never dies. ArXiv, abs/2111.13672, 2021.
- Exploring object-centric temporal modeling for efficient multi-view 3d object detection. In IEEE/CVF International Conference on Computer Vision, pages 3621–3631, 2023b.
- Detr3d: 3d object detection from multi-view images via 3d-to-2d queries. In Conference on Robot Learning, pages 180–191. PMLR, 2022.
- 3d multi-object tracking: A baseline and new evaluation metrics. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 10359–10366. IEEE, 2020.
- Simple online and realtime tracking with a deep association metric. IEEE International Conference on Image Processing (ICIP), pages 3645–3649, 2017.
- Transcenter: Transformers with dense representations for multiple-object tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(6):7820–7835, 2022.
- Bevformer v2: Adapting modern image backbones to bird’s-eye-view recognition via perspective supervision. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17830–17839, 2023.
- Learnable online graph representations for 3d multi-object tracking. IEEE Robotics and Automation Letters, 7(2):5103–5110, 2022.
- Motr: End-to-end multiple-object tracking with transformer. In European Conference on Computer Vision, pages 659–675. Springer, 2022.
- Mutr3d: A multi-camera tracking framework via 3d-to-2d queries. In IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 4537–4546, 2022a.
- Bytetrack: Multi-object tracking by associating every detection box. In European Conference on Computer Vision, 2021.
- Beverse: Unified perception and prediction in birds-eye-view for vision-centric autonomous driving. ArXiv, 2205.09743, 2022b.
- Bytetrackv2: 2d and 3d multi-object tracking by associating every detection box. ArXiv, abs/2303.15334, 2023.
- Tracking objects as pixel-wise distributions. In European Conference on Computer Vision, pages 76–94. Springer, 2022.
- Global tracking transformers. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8771–8780, 2022.