Papers
Topics
Authors
Recent
Search
2000 character limit reached

RTP-DETR: Fractional Matching via Optimal Transport

Updated 12 January 2026
  • The paper introduces a fractional matching framework that replaces hard assignments with a differentiable, entropy-regularized optimal transport plan computed via Sinkhorn iterations.
  • It leverages entropy regularization and the Sinkhorn–Knopp algorithm to improve convergence and detection accuracy, especially in ambiguous and dense object scenarios.
  • Empirical results on MS-COCO demonstrate significant AP gains (up to +3.5 points) while keeping computational overhead low compared to traditional Hungarian matching.

Fractional Matching via Optimal Transport (RTP-DETR) denotes a methodology wherein the assignment of predictions to ground-truth objects in detection transformers is reformulated as a soft, fractional correspondence problem. This paradigm replaces strict one-to-one matchings, such as those obtained by the Hungarian algorithm, with a differentiable and entropy-regularized optimal transport (OT) plan, optimizing a relaxed cost that allows probability mass to be spread among multiple assignment candidates. The transport plan is computed efficiently using the Sinkhorn–Knopp algorithm and yields consistent improvements in detection accuracy, convergence rate, and handling of dense or ambiguous detection scenarios (Zareapoor et al., 6 Mar 2025).

1. Mathematical Foundations of Fractional Matching in RTP-DETR

Fractional matching via RTP in Detection Transformers formalizes the set assignment process as an entropy-regularized OT problem. Let NN denote the number of ground-truth objects and MM the number of predictions. The cost matrix CRN×MC \in \mathbb{R}^{N \times M} is formed via class and geometric terms: Cij=logpj(ci)+λbboxLbbox(bi,bj)+λGIoU(1GIoU(bi,bj))C_{ij} = -\log p_j(c_i^*) + \lambda_{\text{bbox}}\cdot L_{\text{bbox}}(b_i^*, b_j) + \lambda_{\text{GIoU}}\cdot (1-\mathrm{GIoU}(b_i^*, b_j)) where pj(ci)p_j(c_i^*) is the predicted probability for ground-truth class cic_i^* at index jj, LbboxL_{\text{bbox}} is a box regression loss, and GIoU\mathrm{GIoU} is the Generalized Intersection-over-Union metric.

The assignment is structured as a “transport plan” TR+N×MT \in \mathbb{R}_+^{N\times M}, subject to marginal constraints: MM0 with MM1, MM2 probability simplex vectors. The objective is: MM3 where MM4 is the entrywise entropy, and MM5 is the regularization parameter. The solution, derived via the Euler–Lagrange equations, results in a Gibbs kernel scaling solved via Sinkhorn iterations.

2. Sinkhorn–Knopp Algorithm for Transport Plan Computation

The entropy-regularized OT admits computational efficiency and differentiability through the Sinkhorn–Knopp algorithm. Define the Gibbs kernel: MM6 Introduce scaling vectors MM7. The iterative steps are: MM8 upon convergence: MM9 Each iteration is CRN×MC \in \mathbb{R}^{N \times M}0; empirical convergence is reached in CRN×MC \in \mathbb{R}^{N \times M}1. All computations are highly parallelizable and can use log-domain arithmetic (log-Sinkhorn trick) for numerical stability. The resulting plan CRN×MC \in \mathbb{R}^{N \times M}2 distributes assignment mass fractionally, providing a gradient-friendly and robust alternative to hard matching.

3. Integration and Training Workflow of RTP in DETR

RTP replaces the discrete alignment imposed by Hungarian matching with continuous fractional assignments. In classical DETR, prediction–ground-truth pairs are selected by solving a permutation CRN×MC \in \mathbb{R}^{N \times M}3 and summing losses: CRN×MC \in \mathbb{R}^{N \times M}4 In RTP-DETR, the fractional plan CRN×MC \in \mathbb{R}^{N \times M}5 is computed for all pairs, and the transport-based loss is: CRN×MC \in \mathbb{R}^{N \times M}6 where CRN×MC \in \mathbb{R}^{N \times M}7. Training proceeds by forward pass (form CRN×MC \in \mathbb{R}^{N \times M}8), Sinkhorn computation of CRN×MC \in \mathbb{R}^{N \times M}9, evaluation of Cij=logpj(ci)+λbboxLbbox(bi,bj)+λGIoU(1GIoU(bi,bj))C_{ij} = -\log p_j(c_i^*) + \lambda_{\text{bbox}}\cdot L_{\text{bbox}}(b_i^*, b_j) + \lambda_{\text{GIoU}}\cdot (1-\mathrm{GIoU}(b_i^*, b_j))0, and backpropagation through both the network and Sinkhorn solver.

4. Theoretical Properties and Probabilistic Interpretation

RTP and related fractional matching via OT possess important structural and probabilistic properties (Shalam et al., 2022):

  • Parameter-free assignment: Only the entropy regularization Cij=logpj(ci)+λbboxLbbox(bi,bj)+λGIoU(1GIoU(bi,bj))C_{ij} = -\log p_j(c_i^*) + \lambda_{\text{bbox}}\cdot L_{\text{bbox}}(b_i^*, b_j) + \lambda_{\text{GIoU}}\cdot (1-\mathrm{GIoU}(b_i^*, b_j))1 is tunable.
  • Differentiability: Full workflow (network and transport) is differentiable; enables end-to-end optimization.
  • Set-equivariance and symmetry: Permuting inputs permutes assignments consistently.
  • Probabilistic interpretation: The plan Cij=logpj(ci)+λbboxLbbox(bi,bj)+λGIoU(1GIoU(bi,bj))C_{ij} = -\log p_j(c_i^*) + \lambda_{\text{bbox}}\cdot L_{\text{bbox}}(b_i^*, b_j) + \lambda_{\text{GIoU}}\cdot (1-\mathrm{GIoU}(b_i^*, b_j))2 represents fractional belief in the match between ground-truth Cij=logpj(ci)+λbboxLbbox(bi,bj)+λGIoU(1GIoU(bi,bj))C_{ij} = -\log p_j(c_i^*) + \lambda_{\text{bbox}}\cdot L_{\text{bbox}}(b_i^*, b_j) + \lambda_{\text{GIoU}}\cdot (1-\mathrm{GIoU}(b_i^*, b_j))3 and prediction Cij=logpj(ci)+λbboxLbbox(bi,bj)+λGIoU(1GIoU(bi,bj))C_{ij} = -\log p_j(c_i^*) + \lambda_{\text{bbox}}\cdot L_{\text{bbox}}(b_i^*, b_j) + \lambda_{\text{GIoU}}\cdot (1-\mathrm{GIoU}(b_i^*, b_j))4; row Cij=logpj(ci)+λbboxLbbox(bi,bj)+λGIoU(1GIoU(bi,bj))C_{ij} = -\log p_j(c_i^*) + \lambda_{\text{bbox}}\cdot L_{\text{bbox}}(b_i^*, b_j) + \lambda_{\text{GIoU}}\cdot (1-\mathrm{GIoU}(b_i^*, b_j))5 is a probability distribution over predictions and vice versa. Soft assignment naturally models ambiguity and object overlap.

5. Empirical Performance on Detection and Ablation Analyses

On MS-COCO val2017 (ResNet-50, 12 epochs), RTP-DETR yields 50.4 AP, compared to Deformable-DETR's 46.9 and DINO-DETR's 49.7—absolute gains of +3.5 AP and +0.7 AP, respectively. On PASCAL VOC, the system demonstrates consistent improvements of +1–2 mAP. Ablation studies indicate:

  • Setting Cij=logpj(ci)+λbboxLbbox(bi,bj)+λGIoU(1GIoU(bi,bj))C_{ij} = -\log p_j(c_i^*) + \lambda_{\text{bbox}}\cdot L_{\text{bbox}}(b_i^*, b_j) + \lambda_{\text{GIoU}}\cdot (1-\mathrm{GIoU}(b_i^*, b_j))6 collapses fractional matching to hard OT, decreasing AP by ~1.5 points.
  • Optimal Cij=logpj(ci)+λbboxLbbox(bi,bj)+λGIoU(1GIoU(bi,bj))C_{ij} = -\log p_j(c_i^*) + \lambda_{\text{bbox}}\cdot L_{\text{bbox}}(b_i^*, b_j) + \lambda_{\text{GIoU}}\cdot (1-\mathrm{GIoU}(b_i^*, b_j))7 is Cij=logpj(ci)+λbboxLbbox(bi,bj)+λGIoU(1GIoU(bi,bj))C_{ij} = -\log p_j(c_i^*) + \lambda_{\text{bbox}}\cdot L_{\text{bbox}}(b_i^*, b_j) + \lambda_{\text{GIoU}}\cdot (1-\mathrm{GIoU}(b_i^*, b_j))8 with Cij=logpj(ci)+λbboxLbbox(bi,bj)+λGIoU(1GIoU(bi,bj))C_{ij} = -\log p_j(c_i^*) + \lambda_{\text{bbox}}\cdot L_{\text{bbox}}(b_i^*, b_j) + \lambda_{\text{GIoU}}\cdot (1-\mathrm{GIoU}(b_i^*, b_j))9; AP peaks in pj(ci)p_j(c_i^*)0.
  • Sinkhorn iterations pj(ci)p_j(c_i^*)1 are sufficient.

Performance gains stem from robust handling of varying densities, overlapping objects, and improved AP for small objects (boosted +1.5–2 points for AP_Small). Computational overhead amounts to pj(ci)p_j(c_i^*)2 per image but remains highly efficient and favorable vis-à-vis cubic Hungarian matching.

6. Connections to Self-Optimal-Transport and Broader Applications

The fractional matching methodology in RTP-DETR generalizes the Self-Optimal-Transport (SOT) transform (Shalam et al., 2022), which also solves an entropy-regularized OT problem but in the context of feature set refinement for clustering and few-shot classification. In SOT, feature similarity matrices define costs, and the plan pj(ci)p_j(c_i^*)3 upgrades feature sets probabilistically:

  • Unsupervised clustering: SOT-transformed features improve k-means clustering accuracy, NMI, ARI across varying dimensions and noise.
  • Few-shot classification: SOT as post-processing boosts 5-way 1-shot gains by up to ~10% and 5-shot by ~5% on Mini-ImageNet, CIFAR-FS, and CUB.
  • Person re-identification: SOT on union of gallery/query features improves mAP and Rank-1 by 2–7%.

Both SOT and RTP utilize entropy-regularized fractional plans for enhanced intra-class cohesion and inter-class separation.

7. Limitations, Implementation, and Practical Considerations

The main constraints of RTP-DETR are additional complexity pj(ci)p_j(c_i^*)4 per batch and sensitivity to hyperparameter choices (pj(ci)p_j(c_i^*)5, pj(ci)p_j(c_i^*)6). Poor regularization may cause over-splitting or degeneration to greedy assignments. Empirically, tuning pj(ci)p_j(c_i^*)7 and iteration counts yields optimal results and grants stable training. Fractional correspondence provides faster convergence due to smoother gradients and improves model performance in scenarios with ambiguous or dense object distributions.

Forward and Sinkhorn computations add pj(ci)p_j(c_i^*)8 ms per image (ResNet-50, 12 epochs); inference achieves 25 FPS vs. 23 FPS for Hungarian-matched DINO-DETR, with an absolute gain of +0.7 AP.

In summary, fractional matching via optimal transport in RTP-DETR and related SOT transforms embodies a differentiable, parameter-free framework that elegantly unifies assignment, grouping, and ranking mechanisms across matching, clustering, and detection domains (Zareapoor et al., 6 Mar 2025, Shalam et al., 2022).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Fractional Matching via Optimal Transport (RTP-DETR).