ByteTrackV2: 2D & 3D Multi-Object Tracking
- ByteTrackV2 is a nonparametric multi-object tracking framework that unifies 2D and 3D tracking through hierarchical data association.
- It employs a two-stage matching process using both high- and low-confidence detections to reduce missed objects and maintain track continuity.
- The method demonstrates superior performance on nuScenes with 56.4% AMOTA for cameras and 70.1% AMOTA for LiDAR, highlighting its practical impact.
ByteTrackV2 is a nonparametric multi-object tracking (MOT) framework that extends the original ByteTrack to both 2D and 3D object tracking by leveraging a generic hierarchical data association mechanism and a complementary motion prediction strategy for 3D scenarios. This tracker is characterized by its ability to associate every detection box—including those with low confidence scores—in a two-stage matching process, combined with a detector-velocity–aware Kalman filter, and is designed for seamless integration with arbitrary off-the-shelf detectors without the need for retraining. ByteTrackV2 has demonstrated state-of-the-art performance on the nuScenes 3D MOT leaderboard, achieving 56.4% AMOTA with cameras and 70.1% AMOTA with LiDAR (Zhang et al., 2023).
1. Algorithmic Framework and Pipeline
ByteTrackV2 operates under a unified architecture for both 2D and 3D MOT tasks. For each video frame, the tracker processes a set of detection boxes with associated scores (2D) or 3D coordinates and optional velocity outputs (3D). The pipeline contains the following high-level stages:
- Hierarchical Data Association: A two-stage process in which high-score detections are first matched to existing tracks, followed by a mining of true objects from lower-confidence detections to reduce missed objects and fragmented trajectories.
- Track Management: Tracks are updated, created, or terminated based on the association results.
- Motion Prediction (3D): Augments the tracker with a Kalman filter that fuses detector-provided velocities with linear prediction, improving robustness to abrupt motion and short-term occlusions.
This approach is nonparametric, with all steps grounded in discrete association and classical filtering methodologies. Only two hyperparameters—the high- and low-score thresholds (, )—are required and can be adapted per detector without network retraining (Zhang et al., 2023).
2. Hierarchical Data Association
The central contribution of ByteTrackV2 is its hierarchical data association strategy, which mines both high- and low-score detection boxes to maximize tracking recall and maintain track integrity:
- Stage 1: Detections with confidence () are associated to existing tracks using the Hungarian algorithm. Matching costs are defined as for 2D () or the distance in BEV coordinates for 3D ().
- Stage 2: The remaining unmatched tracks and detections with () are matched to recover additional objects.
- Track Update and Management: Tracks are updated with associated detections, newly-identified tracks are created for unmatched high-score detections, and tracks with no matches in either stage are terminated.
The two thresholds typically use , for high and low scores, respectively. The hierarchical mechanism demonstrates a significant impact: ablation studies show that removing Stage 2 decreases AMOTA by approximately 5 points (Zhang et al., 2023).
3. Complementary Motion Prediction for 3D Tracking
For 3D MOT, ByteTrackV2 incorporates a complementary motion prediction strategy in which the Kalman filter state is defined as , incorporating both position and velocity in world coordinates.
- Prediction: The transition model integrates velocity and position with time step . The motion model is standard, but accurate process and measurement noise modeling is possible due to direct detector velocity input.
- Detector Velocity Fusion: Detected velocity is included in the measurement vector , allowing the tracker's state to be corrected by observed instantaneous motion. This improves recovery from abrupt direction changes and short-term occlusion relative to linear filtering alone.
- Measurement Update: Per standard Kalman filtering equations, track state updates optimally combine prediction and measured detector outputs.
Disabling velocity fusion leads to a ≈3 point drop in AMOTA (Zhang et al., 2023). This implementation is generic and can fuse velocities when provided by any 3D detector.
4. Empirical Performance and Comparative Evaluation
ByteTrackV2 leads the nuScenes 3D MOT leaderboard among camera- and LiDAR-based modalities:
| Modality | AMOTA (%) |
|---|---|
| Camera-based | 56.4 |
| LiDAR-based | 70.1 |
Relative to prior baselines such as AB3DMOT and GNN3DMOT, ByteTrackV2 consistently demonstrates superior performance when evaluated under identical detection inputs. The effect of hierarchical matching and the complementary fusion module is validated through ablation experiments, indicating the importance of each component in the pipeline (Zhang et al., 2023).
5. Design Properties and Practical Integration
ByteTrackV2 exhibits a set of distinct design properties:
- Nonparametric Tracking: The algorithm contains no learned parameters in its association or motion modules; all logic is deterministic and grounded in observed detection outputs.
- Detector-Agnostic: Compatible with any off-the-shelf 2D or 3D object detector, since tracker operations require only bounding boxes, scores, and (optionally) velocities.
- Minimal Hyperparameters: Only two thresholds (, ) govern all association logic; hyperparameter tuning does not require retraining of any component.
- Efficiency: The matching process (Hungarian algorithm) and Kalman filter updates run efficiently in practice ( worst case; typically objects per MOT frame).
A plausible implication is that ByteTrackV2 is well-suited for real-time applications and deployments where detectors may be swapped or retrained independently of the tracker (Zhang et al., 2023).
6. Synthesis and Significance
ByteTrackV2 unifies the task of 2D and 3D MOT through a simple yet effective combination of hierarchical detection association and velocity-aware motion prediction. Recovering missed objects via low-score mining reduces track fragmentation, while detector-velocity fusion in a Kalman filter framework ensures robust trajectory estimation in dynamic scenes. As a nonparametric, detector-agnostic method demonstrating top performance on standard benchmarks, ByteTrackV2 constitutes a practical reference architecture for multi-object tracking research and deployment (Zhang et al., 2023).