Contrastive Proposal Encoding Loss (CPE Loss)

Updated 3 March 2026

Contrastive Proposal Encoding Loss (CPE Loss) is a contrastive objective that refines object proposal representations by enforcing compact intra-class clusters and distinct inter-class features.
It selects positive and negative proposal pairs based on spatial overlap (IoU) and contextual extension, effectively addressing challenges in few-shot and weakly supervised detection.
Empirical results on benchmarks like PASCAL VOC and COCO demonstrate significant AP and mAP improvements, validating the method’s impact on detection performance.

Contrastive Proposal Encoding Loss (CPE Loss) denotes a set of supervised or weakly-supervised contrastive objectives designed to improve object proposal representations for detection, specifically targeting the separation of intra-class and inter-class proposal features as well as improving proposal “integrity” in both fully and weakly supervised detection. CPE Loss achieves this by contrasting proposals within a batch based on spatial overlap (IoU) and/or contextual extension, driving proposal embeddings to be compact within a class and distinct across classes. Two major instantiations of CPE Loss have been introduced: one for few-shot detection through proposal-level contrastive supervision (Sun et al., 2021) and another in weakly-supervised detection via proposal extension and directional LSTM encoding (Lv et al., 2021).

1. Mathematical Formulation in Few-Shot Object Detection

In the context of few-shot object detection (FSOD) (Sun et al., 2021), CPE Loss is formulated over a batch of $N$ sampled region-of-interest (RoI) proposals. Each proposal $i$ is characterized by:

Feature vector $x_i \in \mathbb{R}^{D_R}$ from the detection box head
Ground-truth class label $y_i \in \{1, \ldots, C\}$
IoU with the matched ground-truth box $u_i \in [0,1]$

A lightweight multi-layer perceptron (MLP) projection head transforms $x_i$ to a contrastive embedding $z_i = h(x_i) \in \mathbb{R}^{D_C}$ , which is $\ell_2$ -normalized as $\tilde{z}_i = z_i/\|z_i\|_2$ .

Defining anchor set $A = \{i | u_i \geq \phi\}$ (where $\phi$ is an IoU threshold), the loss for proposal $i$ is:

$L_{z_i} = -\frac{1}{|P(i)|} \sum_{j\in P(i)} \log \left( \frac{\exp((\tilde{z}_i \cdot \tilde{z}_j)/\tau)}{\sum_{k \neq i}\exp((\tilde{z}_i \cdot \tilde{z}_k)/\tau)} \right)$

where $P(i)$ are proposals in $A$ with the same class as $i$ (excluding $i$ ), $\tau$ is the temperature hyper-parameter, and a proposal-quality weighting $f(u_i) = \mathbb{I}\{u_i\geq\phi\}g(u_i)$ (typically $g(u)=1$ ).

The batch-level CPE Loss is:

$L_{CPE} = \frac{1}{N} \sum_{i=1}^N f(u_i) L_{z_i}$

CPE Loss is added to the multi-task detection objective as $L_{total} = L_{rpn\_cls} + L_{rpn\_loc} + L_{roi\_cls} + L_{roi\_reg} + \lambda L_{CPE}$ , with $\lambda = 0.5$ weighting the contrastive term.

2. Positive and Negative Pair Selection via IoU

Positive and negative pairs for contrastive learning are determined based on IoU and matched classes:

Anchor set $A$ comprises proposals with $u_i \geq \phi$ (default $\phi=0.7$ ).
For anchor $i \in A$ , the positive set $P(i) = \{j\neq i | j\in A, y_j = y_i\}$ .
All other proposals $k \neq i$ appear in the denominator, acting as negatives.

This selection ensures that only well-localized proposals contribute, mitigating noise from poorly localized regions. The pair selection mechanism is critical in few-shot regimes, where confusion between visually similar classes is prevalent.

3. Intuition and Mechanistic Rationale

The core objective of CPE Loss is to enhance instance-level intra-class compactness and inter-class variance:

The numerator in $L_{z_i}$ ( $\exp(\tilde{z}_i \cdot \tilde{z}_j) / \tau$ ) “pulls” embeddings of the same class together.
The denominator “pushes” apart embeddings of other classes, increasing the decision margin.
The proposal-quality weighting $f(u_i)$ ensures that only spatially accurate proposals contribute.
Temperature $\tau$ controls distribution sharpness: lower $\tau$ leads to “harder” separation.

This mechanism addresses misclassification of novel instances by encouraging features for the same class to cluster tightly, while dispersing those of different classes, particularly improving the discrimination of rare classes in the FSOD setting (Sun et al., 2021).

4. Integration in Weakly Supervised Detection via Proposal Extension

In weakly supervised object detection (WSOD), CPE Loss assumes a module role within the Multiple-Instance Learning (MIL) paradigm. The module, termed Contrastive Proposal Extension, compares an initial proposal $B$ with its extended counterpart $B_d$ along four directions (left, right, top, bottom). Each extension employs:

RoI-pooling to extract features $X^B, X^{B_d}$
Directional LSTM encoders over the spatial dimension
Dual-stream decoders producing per-proposal, per-class scores

The central “contrastive encoding” score for direction $d$ :

$\mathcal{N}^d_{i,c} = \frac{|\;S^B_{i,c} - S^{B_d}_{i,c}\;| - \min_{i',c'} |\;S^B_{i',c'} - S^{B_d}_{i',c'}\;|}{\max_{i',c'} |\;S^B_{i',c'} - S^{B_d}_{i',c'}\;| - \min_{i',c'} |\;S^B_{i',c'} - S^{B_d}_{i',c'}\;|}$

The final CPE Loss for WSOD is the average MIL loss over both decoders and all four directions:

$L_{CPE} = \frac{1}{4}\sum_{d\in \{L,R,T,B\}} \left( L_W^1(d) + L_W^2(d) \right)$

CPE Loss is added to the total detection objective as an unweighted sum.

5. Implementation Details and Pseudocode for Few-Shot Detection

A training iteration using CPE Loss in FSOD proceeds as follows:

for each training iteration:
    images → backbone + FPN
    anchors → RPN → objectness scores + deltas
    proposals = topK_preNMS(anchors) → NMS → topM proposals
    rois = sample_rois(proposals, batch_size=256, fg/bg=1:3)
    x = RoIAlign(FPN_feats, rois)
    cls_logits, reg_deltas = box_head(x)   
    z = h(x)                                
    z_norm = l2_normalize(z, dim=1)         
    L_rpn = compute_RPN_loss(...)
    L_cls, L_reg = compute_RoI_loss(cls_logits, reg_deltas, ground_truth)
    for i in range(batch_size):
        u_i = IoU(rois[i], matched_gt[i])
        weight[i] = g(u_i) if u_i >= φ else 0
    S = z_norm @ z_norm.T                 
    L_CPE = 0
    valid = {i for i in range(batch_size) if weight[i] > 0}
    for i in valid:
        P_i = {j for j in valid if j != i and y[j] == y[i]}
        denom = sum(exp(S[i,k]/τ) for k in range(batch_size) if k != i)
        L_i = - (1/len(P_i)) * sum(log(exp(S[i,j]/τ)/denom) for j in P_i)
        L_CPE += weight[i]*L_i
    L_CPE /= batch_size
    L_total = L_rpn + L_cls + L_reg + λ*L_CPE
    L_total.backward()
    optimizer.step()

Key implementation notes include doubling the number of RPN pre-NMS proposals to 2000 and reducing the RoI batch size to 256 to prevent foreground proposals from being overwhelmed by backgrounds.

6. Hyper-Parameters and Ablation Results

The following table summarizes primary hyper-parameters and ablation findings for CPE Loss as reported in (Sun et al., 2021):

Parameter	Default / Best Value	Observed Effect
$\mathbf{D_C}$	128	Little effect vs. 256 ( $\sim$ 0.1 AP)
$\mathbf{\tau}$	0.2	Outperforms 0.07/0.5 by 0.5–1.0 AP
$\mathbf{\phi}$	0.7	Best for 5/10-shot; $\phi=0, g(u)=u$ for 3-shot
$\mathbf{g(u)}$	1	$g(u)=u$ (linear) helps only very low-shot cases
$\mathbf{\lambda}$	0.5	Balances detection and contrastive loss
Classification scale	$\alpha=20$	Scales logits for RoI class

Ablation studies indicate that hard-clipping proposals ( $\phi=0.7, g=1$ ) achieves the highest AP for standard few-shot settings, while linear re-weighting slightly benefits extremely low-shot regimes.

7. Empirical Impact and Application Scope

In FSOD, CPE Loss significantly improves detection performance by mitigating confusion between novel and base classes, yielding up to +8.8% AP on PASCAL VOC and +2.7% AP on COCO (Sun et al., 2021). In WSOD, the module drives mAP improvements from $\sim$ 41% to 55.9% on VOC 2007 (pure MIL), moving closer to the fully supervised regime (Lv et al., 2021). Using all four directions in the proposal extension variant outperforms any single direction by 1–2% mAP, and ablations confirm the necessity of the dual-stream decoder. This suggests that the contrastive integrity mechanism is most effective when jointly optimized with proposal-level cross-entropy constraints and explicit context-dependent extension.

CPE Loss variants provide a simple, modular, and effective drop-in objective for object detection tasks characterized by limited supervision or scarce data, with empirical gains robust across standard benchmarks.

Markdown Report Issue Upgrade to Chat

References (2)

FSCE: Few-Shot Object Detection via Contrastive Proposal Encoding (2021)

Contrastive Proposal Extension with LSTM Network for Weakly Supervised Object Detection (2021)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Contrastive Proposal Encoding Loss (CPE Loss).