DIPM: Difficulty-Aware Instance Pair Matcher

Updated 15 November 2025

The paper introduces DIPM, a framework that categorizes instance pairs by difficulty to enhance multi-modal 3D object detection.
It employs IoU-based matching for easy pairs and cosine similarity for hard instance pairing to exploit cross-modal complementarity.
Empirical results on nuScenes show improved mAP, NDS, and recall, particularly for distant, small, or occluded objects.

The Difficulty-aware Instance Pair Matcher (DIPM) is a principled framework for addressing modality-dependent difficulty in multi-modal 3D object detection, where the goal is to accurately detect and localize objects using both LiDAR and camera sensors. DIPM is designed for settings—exemplified by autonomous driving—where challenging targets (“hard instances”), such as distant, small, or occluded objects, may be poorly represented in one sensing modality yet well captured in another. DIPM departs from previous single-guided paradigms by decomposing the instance-level fusion process according to instance difficulty with respect to each modality, generating pairings that are exploited in a dual-guided fusion architecture to better leverage complementary information. DIPM is a core component within DGFusion and has been empirically shown to improve 3D detection, particularly under adverse conditions, on benchmarks such as nuScenes (Jia et al., 13 Nov 2025).

1. Design Rationale and Conceptual Foundations

DIPM was motivated by the observation that information density for “hard instances”—those that are distant, small in spatial extent, or occluded—varies markedly between LiDAR and camera modalities. Traditional multi-modal fusion approaches, including both Point-guide-Image (LiDAR dominates) and Image-guide-Point (camera dominates) frameworks, operate under a one-directional matching constraint and do not distinguish whether an instance is “hard” for a particular modality. DIPM aims to integrate this difficulty awareness by classifying proposals as:

Easy Instance Pairs (EIP), where both modalities provide reliable representations;
Camera-hard/ LiDAR-easy Instance Pairs (C-HIP), where only LiDAR is reliable;
LiDAR-hard/ Camera-easy Instance Pairs (L-HIP), where only the camera is reliable.

This scheme enables differentiated fusion strategies that exploit cross-modal complementarity according to instance-specific difficulty.

2. Mathematical Formulation and Instance Pair Definition

Let $I_L = \{I_L^i\}_{i=1}^{N_L}$ and $I_C = \{I_C^j\}_{j=1}^{N_C}$ denote the instance-level features for LiDAR and camera proposals, respectively, after filtering via objectness score threshold $\gamma$ .

The DIPM algorithm proceeds in two stages:

Stage 1: Easy Instance Pair (EIP) Assignment

Each LiDAR instance $I_L^i$ is matched to the camera instance $I_C^{j_i}$ with the maximum 2D bounding box intersection-over-union (IoU):

$j_i = \arg\max_{1\leq j \leq N_C} \mathrm{IoU}(I_L^i, I_C^j)$

An instance pair $\{I_L^i, I_C^{j_i}\}$ is designated as an EIP if its IoU exceeds a threshold $\eta$ :

$\mathrm{IoU}(I_L^i, I_C^{j_i}) > \eta,\quad\eta \in (0,1)$

The EIP set is defined as: $\mathrm{EIP} = \left\{ \{I_L^i, I_C^{j_i}\}~|~\mathrm{IoU}(I_L^i, I_C^{j_i}) > \eta \right\}$

Stage 2: Hard Instance Pair (HIP) Assignment

Unmatched proposals for each modality are designated “hard” for that modality.

C-HIP (“Camera-hard”): For each unmatched camera instance $I_C^{um_k}$ , locate the most similar matched camera feature $I_C^{m_t}$ (from EIP) in cosine similarity:

$t(k) = \arg\max_{I_C^{m} \in \mathrm{matched}} ~ \frac{\langle I_C^{um_k}, I_C^{m} \rangle}{\|I_C^{um_k}\| \|I_C^{m}\|}$

Pair $I_C^{um_k}$ with the corresponding LiDAR feature $I_L^{m_{t(k)}}$ from EIP; the complete set forms C-HIP: $\mathrm{C\text{-}HIP} = \left\{ \{I_C^{um_k}, I_L^{m_{t(k)}}\} \right\}_k$

L-HIP (“LiDAR-hard”): Defined analogously via unmatched LiDAR instances and cosine similarity on LiDAR features.

3. DIPM Algorithmic Workflow

The following table summarizes the main steps in DIPM's algorithmic procedure.

Step	Operation	Output
1	IoU-based matching ( $\eta$ , $\gamma$ )	EIP
2	Identify unmatched instances	Hard cases
3	Cosine similarity pairing for unmatched features	C-HIP, L-HIP
4	Return instance pairs	EIP, C-HIP, L-HIP

The precise algorithm, as stated in the source, involves iterating over all LiDAR instances and matching as above, then pairing unmatched camera and LiDAR proposals by maximum cosine similarity to their respective modality matches. Hyperparameters include the proposal score threshold ( $\gamma=0.7$ ) and IoU threshold ( $\eta=0.7$ ).

4. Matching Criteria and Loss Function Integration

DIPM employs distinct matching criteria at each stage. Stage 1 relies on 2D bounding-box IoU to establish “easy” instance pairs, while Stage 2 utilizes cosine similarity to associate unmatched (hard) proposals to their softest counterpart among the matched (easy) group.

A dedicated loss term regularizes the cross-modal similarity among EIPs:

$\mathcal{L}_{\mathrm{Cos}} = \frac{1}{|\mathrm{EIP}|} \sum_{\{I_L^i, I_C^i\} \in \mathrm{EIP}} \Bigl( 1 - \frac{ \langle I_L^i, I_C^i \rangle }{ \|I_L^i\| \|I_C^i\| } \Bigr )$

The full network loss aggregates the detection head, auxiliary instance-prediction, and similarity penalty components:

$\mathcal{L} = \lambda_1\,\mathcal{L}_{H} + \lambda_2\,\mathcal{L}_{L} + \lambda_3\,\mathcal{L}_{C} + \lambda_4\,\mathcal{L}_{\mathrm{Cos}}$

with typical values $\lambda_1 = 0.99$ , $\lambda_2 = \lambda_3 = 10^{-4}$ , and $\lambda_4 = 10^{-2}$ .

The presence of the cosine similarity penalty is intended to stabilize early training and encourage cohesive embedding alignment for EIPs.

5. Dual-guided Fusion Modules and Pair Utilization

DIPM's outputs are directly exploited by two specialized modules within the dual-guided fusion paradigm:

Point-guide-Image Enhancement (PGIE): Uses easy LiDAR features from EIP and C-HIP to reweight and inject feature augmentation into the camera BEV representation at individual instance centers.
Image-guide-Point Enhancement (IGPE): Utilizes easy camera features from L-HIP to strengthen the LiDAR BEV representation through neighborhood interpolation and distance-weighted fusion.

In both modules, the cross-modal “easy” partner provides feature augmentation, guiding correction of “hard” modality representations at the instance level.

6. Hyperparameters and Implementation Considerations

DIPM introduces several key hyperparameters:

$\gamma$ : Proposal score filter (typical value 0.7); adjusts trade-off between noise suppression and available matches.
$\eta$ : IoU threshold for defining EIP (typical value 0.7).
Loss weights ( $\lambda_1$ – $\lambda_4$ ): See above for canonical settings.

These hyperparameters modulate the selectivity and stability of instance matching and fusion. The value of $\gamma$ in particular regulates sensitivity to spurious proposals during early learning.

7. Impact on 3D Object Detection Performance

Empirical evaluation on the nuScenes benchmark demonstrates the efficacy of DIPM in enhancing detection, especially for hard instances. Relative to the BEVFusion baseline, the application of DIPM with dual-guided modules yields:

For objects at distance >40m: mAP $+1.2$ , NDS $+0.6$
Under low visibility (tokens 1/2/3): mAP $+0.8$ , NDS $+0.3$
For small objects (volume 0–10 m³): mAP $+0.5$ , NDS $+0.4$

Recall at IoU=0.5 increases by +1.4, and at IoU=0.3 by +1.9. Overall, DGFusion models with DIPM achieve +1.0 mAP, +0.8 NDS, and +1.3 average recall over baseline on the nuScenes test set.

This indicates that DIPM, by enabling difficulty-aware fusion at the instance level, substantially improves the robustness of multi-modal 3D object detectors under a breadth of challenging real-world conditions.

Markdown Upgrade to Chat

References (1)

DGFusion: Dual-guided Fusion for Robust Multi-Modal 3D Object Detection (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Difficulty-aware Instance Pair Matcher (DIPM).