Papers
Topics
Authors
Recent
Search
2000 character limit reached

MR2-ByteTrack: Multi-Rescored MOT Framework

Updated 8 January 2026
  • The paper introduces a novel multi-resolution rescored approach that enhances tracking accuracy using hierarchical two-stage association and explicit motion fusion.
  • It leverages a nonparametric, detector-agnostic design with minimal hyperparameter tuning to ensure efficient real-time performance in both 2D and 3D applications.
  • Empirical results on benchmarks like nuScenes show significant improvements in AMOTA and track recovery, underscoring its practical impact.

ByteTrackV2 is a nonparametric multi-object tracking (MOT) framework that generalizes the ByteTrack philosophy to both 2D and 3D tracking tasks by associating every detection box, regardless of detection score. Employing a hierarchical two-stage association and an explicit motion fusion module, ByteTrackV2 maximizes track recovery and identity consistency even in the presence of score fluctuations, fragmented trajectories, and occlusions. The tracker is designed to be detector-agnostic, requiring no retraining or learnable parameters, and demonstrates state-of-the-art performance on benchmarks such as nuScenes in both camera and LiDAR modalities (Zhang et al., 2023).

1. General Pipeline Overview

ByteTrackV2 processes input streams of detection boxes, relying on their bounding box coordinates, detection scores, and (in 3D) velocity vectors. For 2D MOT, it uses {di}i=1N\{d_i\}_{i=1}^N (boxes with scores sis_i), while for 3D MOT it leverages {(ci,si,vi)}\{(c_i, s_i, v_i)\}, with center coordinates ci=(xi,yi,zi)c_i=(x_i, y_i, z_i), score sis_i, and detected velocity vi=(vx,i,vy,i,vz,i)v_i=(v_{x,i}, v_{y,i}, v_{z,i}) when available. The pipeline comprises hierarchical data association followed by track management, with an additional complementary motion-prediction stage for 3D scenarios. Detector outputs are treated as black-box input; ByteTrackV2 requires only two hyperparameters (score thresholds shs_h, sls_l) and supports efficient runtime operation suitable for real-time applications.

2. Hierarchical Data Association Mechanism

ByteTrackV2 employs a two-stage association paradigm using score thresholds shs_h and sls_l (sh>sls_h > s_l). High-score detection boxes (Dh\mathcal{D}_h) are first matched to existing tracks through assignment cost minimization:

  • In 2D, matching cost is 1IoU(Bt,Bd)1 - \operatorname{IoU}(B_t, B_d), where BtB_t and BdB_d are track and detection boxes.
  • In 3D, cost is ctcd2||c_t - c_d||_2 (Euclidean distance of box centers), computed either in bird’s-eye-view (BEV) or full 3D.

Unmatched tracks and detections from Stage 1 undergo a second round of Hungarian matching with low-score detections (Dl\mathcal{D}_l), mining true positives missed due to unstable detection scores. Track management modules update, initiate, or terminate tracks based on the association results. This process is formalized in the hierarchical matching pseudocode:

1
2
3
4
5
6
7
8
9
Input: Tracks T, Detections D, thresholds s_h, s_l
Split D into D_h (s >= s_h), D_l (s_l <= s < s_h)
Stage 1: match_1 = Hungarian(T, D_h, cost)
Update tracks with match_1
Stage 2: match_2 = Hungarian(unmatched in T, D_l, cost)
Update tracks with match_2
Create new tracks for unmatched D_h
Terminate tracks not matched in both stages
Return updated tracks

Stage 2 association is empirically shown to be vital: eliminating it drops nuScenes AMOTA by approximately 5 points.

3. Complementary Motion Prediction in 3D MOT

The 3D tracking component augments motion modeling with detector-inferred velocities. The state vector is xt=[xt,yt,zt,vx,t,vy,t,vz,t]\mathbf{x}_t = [x_t, y_t, z_t, v_{x,t}, v_{y,t}, v_{z,t}]^{\top}, expressing position and velocity in world coordinates. Prediction uses a standard linear Kalman filter:

x^t=Fxt1,P^t=FPt1F+Q\hat{\mathbf{x}}_t = F\mathbf{x}_{t-1},\quad \hat{P}_t = F P_{t-1} F^\top + Q

with state transition FF, process noise QQ, and time-step Δt\Delta t. Measurement updates incorporate both positions and velocities provided by the detector:

zt=[xtdet,ytdet,ztdet,vx,tdet,vy,tdet,vz,tdet]\mathbf{z}_t = [x^{\text{det}}_t, y^{\text{det}}_t, z^{\text{det}}_t, v^{\text{det}}_{x,t}, v^{\text{det}}_{y,t}, v^{\text{det}}_{z,t}]^{\top}

Fusion of detector-provided velocity enables robust short-term trajectory continuity, especially during abrupt motion or occlusion periods. Removing this velocity fusion step results in a decline of approximately 3 AMOTA points in nuScenes evaluations.

4. Empirical Performance

Extensive experiments on nuScenes show that ByteTrackV2 leads the 3D MOT leaderboard:

Modality AMOTA (%)
Camera-based 56.4
LiDAR-based 70.1

Ablation studies confirm that both hierarchical association stages and velocity-aware motion prediction are essential for optimal performance. ByteTrackV2 consistently exceeds alternatives such as AB3DMOT and GNN3DMOT under shared detection inputs. The detector-agnostic approach enables immediate deployment across diverse 2D or 3D detection models.

5. Implementation Characteristics and Deployment

ByteTrackV2 features a nonparametric architecture—no learnable parameters exist in association or motion modules. The framework requires only the score thresholds shs_h and sls_l (e.g., sh=0.6s_h=0.6, sl=0.1s_l=0.1). It treats detectors as interchangeable modules, requiring no retraining or fine-tuning during detector changes. Hungarian matching, employed for assignment, has worst-case O(n3)O(n^3) complexity, but practical track and detection numbers (typically n100n \approx 100 per frame) allow efficient, real-time operation.

6. Significance and Context

ByteTrackV2 unifies the byte-level association principle across 2D and 3D tracking regimes, providing a generic and efficient solution for MOT. Its approach—systematic mining of low-score detections in hierarchical stages—addresses common failure modes including missed tracks and fragmented trajectories. On modern benchmarks, its detector-agnostic and hyperparameter-light design simplifies deployment in research and industrial settings. A plausible implication is expanded applicability to multi-modal or non-standard detection sources without modifications to the tracking framework (Zhang et al., 2023). ByteTrackV2 exemplifies a trend toward robust, nonparametric tracking systems that leverage advances in detector design while decoupling the tracking logic from detection learning.

7. References and Further Reading

Further developments could explore the extension of hierarchical matching and complementary fusion paradigms to novel detection modalities and spatio-temporal settings.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Multi-Resolution Rescored ByteTrack (MR2-ByteTrack).