Papers
Topics
Authors
Recent
Search
2000 character limit reached

Contrastive Crop–Caption Loss

Updated 17 February 2026
  • The paper introduces SECLA/SECLA-B, a method that significantly improves cross-modal face–name alignment, achieving up to 81.86% F1 on LFW.
  • It employs a symmetry-enhanced contrastive mechanism, using bidirectional similarity between face and name embeddings to robustly align image crops with caption names.
  • The framework incorporates a two-stage bootstrapped approach with prototype regularization to mitigate catastrophic forgetting and enhance weak supervision.

The contrastive crop–caption loss is a class of weakly supervised objectives designed for cross-modal face–name alignment tasks, where the goal is to assign detected faces (crops) within an image to the named entities appearing in the associated caption. Employing a symmetry-enhanced contrastive mechanism, this loss operates by maximizing bidirectional similarity between sets of faces and caption names, rather than relying solely on set-level uncertainty reasoning or classic multiple instance learning. The method, formalized as the Symmetry-Enhanced Contrastive Learning-based Alignment (SECLA) loss and extended in a two-stage bootstrapped version (SECLA-B), has established state-of-the-art results on benchmarks such as Labeled Faces in the Wild (LFW) and CelebrityTogether (CelebTo), and serves as a principled framework for cross-modal instance alignment tasks (Qu et al., 2022).

1. Model Architecture and Embeddings

The framework for contrastive crop–caption loss comprises three principal components: separate encoders for faces and names, and projection modules to map these embeddings into a common latent space. Detected face crops fif_i are first embedded using a pre-trained FaceNet model to yield XfiRdfX_f^i\in\mathbb{R}^{d_f}, where df=512d_f=512. Named entities njn_j (including a special “NONAME” token for null-links) are embedded with pre-trained BERT, resulting in XnjRdnX_n^j\in\mathbb{R}^{d_n}, with dn=768d_n=768. To harmonize dimensionality for contrastive comparison, name embeddings are projected via a 1-layer MLP gn:RdnRdfg_n:\mathbb{R}^{d_n}\to\mathbb{R}^{d_f} to XnjX_{n'}^j; subsequently, both face and name vectors are further projected by a 3-layer MLP gc:RdfRdpg_c:\mathbb{R}^{d_f}\to\mathbb{R}^{d_p} (dp=128d_p=128; all projections using ReLU activations). The final similarity score is computed as a simple inner product:

sim(fi,nj)=(Xfpi)Xnpj\text{sim}(f_i, n_j) = (X_{f_p}^i)^\top X_{n_p}^j

where Xfpi=gc(Xfi)X_{f_p}^i = g_c(X_f^i) and Xnpj=gc(Xnj)X_{n_p}^j = g_c(X_{n'}^j).

2. Pair Definitions and Contrastive Setup

Within each training batch of BB image–caption pairs, each kk-th item provides a local set of faces Fk={fk,1,,fk,nk}F^k = \{f_{k,1},\ldots,f_{k,n_k}\} and names Nk={nk,1,,nk,mk}N^k = \{n_{k,1},\ldots,n_{k,m_k}\}. All intra-pair face–name pairs are taken as positive matches. Negative samples for the face-to-name direction consist of all names from other captions in the batch (and symmetrically for name-to-face). This construction provides dense contrastive supervision—a critical departure from “unidirectional” or binary loss formulations, which have demonstrated inferior performance (F1 ≈ 40–44% on LFW compared to 80.83% for SECLA).

In the SECLA-B variant, following a first-stage of easier one-to-one matches, positive sets are explicitly grounded in face–name pairs established in early-stage training, while negative samples leverage prototype embeddings from other identities.

3. Symmetry-Enhanced Contrastive Loss Formulation (SECLA)

The SECLA loss is constructed to enforce symmetric, dense alignment between faces and names at the set level. For any set of faces FF and names NN:

  • The dense similarity from faces to names is:

simd(F,N)=1FfiFmaxnjNAi,jwhereAi,j=sim(fi,nj)\text{sim}_d(F,N) = \frac{1}{|F|}\sum_{f_i\in F}\max_{n_j\in N} A_{i,j} \quad \text{where} \quad A_{i,j} = \text{sim}(f_i, n_j)

  • Symmetrically, the dense similarity from names to faces:

simd(N,F)=1NnjNmaxfiFAj,iwhereAj,i=sim(nj,fi)\text{sim}_d(N,F) = \frac{1}{|N|}\sum_{n_j\in N}\max_{f_i\in F} A'_{j,i} \quad \text{where} \quad A'_{j,i} = \text{sim}(n_j, f_i)

The batchwise contrastive losses become:

Lf,n=1Bk=1Blogexp(simd(Fk,Nk))=1Bexp(simd(F,Nk))\mathcal{L}_{f,n} = -\frac{1}{B}\sum_{k=1}^B \log \frac{\exp(\text{sim}_d(F^k, N^k))}{\sum_{\ell=1}^B\exp(\text{sim}_d(F^\ell, N^k))}

Ln,f=1Bk=1Blogexp(simd(Nk,Fk))=1Bexp(simd(N,Fk))\mathcal{L}_{n,f} = -\frac{1}{B}\sum_{k=1}^B \log \frac{\exp(\text{sim}_d(N^k, F^k))}{\sum_{\ell=1}^B\exp(\text{sim}_d(N^\ell, F^k))}

To further enforce bidirectional consistency, an agreement loss penalizes asymmetry between face-to-name and name-to-face similarities for each image–caption pair:

Lagree=1Bk=1B(simd(Fk,Nk)simd(Nk,Fk))2\mathcal{L}_{agree} = \frac{1}{B}\sum_{k=1}^B \left(\text{sim}_d(F^k,N^k) - \text{sim}_d(N^k,F^k)\right)^2

The composite SECLA objective is:

LSECLA=Lf,n+Ln,f+αLagree\boxed{ \mathcal{L}_{SECLA} = \mathcal{L}_{f,n} + \mathcal{L}_{n,f} + \alpha\,\mathcal{L}_{agree} }

with agreement weight α=0.15\alpha=0.15.

4. Bootstrapping and Regularization in SECLA-B

SECLA-B introduces a two-stage curriculum. Stage 1 restricts training to easy cases—pairs with a small number of faces/names and no null-links—using LSECLA\mathcal{L}_{SECLA}, yielding robust initial face–name links. Stage 2 generalizes to the full dataset, partitioning each batch into previously matched pairs (BmatchB_{match}) and unmatched pairs (BunmatchB_{unmatch}). For BunmatchB_{unmatch}, LSECLA\mathcal{L}_{SECLA} continues as before. For BmatchB_{match}, a bootstrapping and regularization loss (Lstage2\mathcal{L}_{stage2}) leverages prototype faces for each known identity, combining:

  • Bidirectional face–name–prototype contrastive loss (Lf,n,pB\mathcal{L}_{f,n,p}^B)
  • Face–prototype clustering loss (Lf,pB\mathcal{L}_{f,p}^B)

Explicitly:

Lstage2=Lf,n,pB+Lf,pB\mathcal{L}_{stage2} = \mathcal{L}_{f,n,p}^B + \mathcal{L}_{f,p}^B

This dual-objective enforces continued learning of new correspondences while preventing forgetting of already-learned face–name links.

5. Implementation and Hyperparameters

The following key details are used in published implementations:

Component Structure/Value Notes
Face encoder FaceNet, df=512d_f=512 Pre-trained; fixed at inference
Name encoder BERT, dn=768d_n=768 Includes “NONAME” ([UNK])
Name projector 1-layer ReLU MLP (768→512) gng_n
Common proj. 3-layer ReLU MLP (512→128) gcg_c
Optimizer Adam, lr=3×1043\times10^{-4} Batch size 20
Epochs SECLA: 30 (LFW), 3 (CelebTo); SECLA-B: stage-1 15/5, stage-2 20/2 LFW/CelebTo
Agreement α\alpha 0.15 No explicit temperature τ\tau

Prototype face selection options include random, medoid, and model-matched faces. An average-face prototype is suboptimal for diverse identities. “NONAME” is fixed as BERT’s [UNK] embedding, with an optional “NOFACE” prototype from noise to accommodate absent links.

6. Experimental Results and Evaluation

On the LFW and CelebrityTogether benchmarks, SECLA and SECLA-B demonstrate superior performance compared to previous methods:

Dataset Method (Author/Year) Main Metric Result
LFW Pham et al. (2010) F1 72.66%
LFW Hinge+top-k (Hessel et al. ’19) F1 66.7%
LFW Unidirectional losses F1 40–44%
LFW SECLA F1 80.83% (Precision=76.96%, Recall=85.11%)
LFW SECLA-αL_agree F1 78.14%
LFW SECLA-B F1 81.86%
CelebTo Unidirectional losses Acc 41–45%
CelebTo SECLA Acc 87.46%
CelebTo SECLA-B Acc 88.36%

The SECLA and SECLA-B objectives thus achieve new state-of-the-art performance on these cross-modal alignment tasks, outperforming prior SOTA by a margin of up to +9 F1 points on LFW and nearly doubling accuracy over unidirectional formulations (Qu et al., 2022).

7. Significance and Adaptability

The contrastive crop–caption loss, and in particular the SECLA/SECLA-B family, provides a robust weakly supervised mechanism for set-to-set alignment in cross-modal domains without requiring explicit instance-level supervision. The symmetric approach and bootstrapped regularization explicitly address both learning and catastrophic forgetting in iterative tasks. The methodological design is adaptable to other domains involving multimodal data and cross-modal entity linking, such as news understanding and multi-object entity resolution scenarios.

This class of losses foregrounds set-level, bidirectional, and symmetry-enforcing objectives in multimodal learning. The use of pre-trained modules (FaceNet, BERT), explicit positive/negative mining at the set level, and curriculum-bootstrapped extensions position the framework as a flexible template for broader weak supervision problems in vision–language research (Qu et al., 2022).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Contrastive Crop–Caption Loss.