Contrastive Crop–Caption Loss

Updated 17 February 2026

The paper introduces SECLA/SECLA-B, a method that significantly improves cross-modal face–name alignment, achieving up to 81.86% F1 on LFW.
It employs a symmetry-enhanced contrastive mechanism, using bidirectional similarity between face and name embeddings to robustly align image crops with caption names.
The framework incorporates a two-stage bootstrapped approach with prototype regularization to mitigate catastrophic forgetting and enhance weak supervision.

The contrastive crop–caption loss is a class of weakly supervised objectives designed for cross-modal face–name alignment tasks, where the goal is to assign detected faces (crops) within an image to the named entities appearing in the associated caption. Employing a symmetry-enhanced contrastive mechanism, this loss operates by maximizing bidirectional similarity between sets of faces and caption names, rather than relying solely on set-level uncertainty reasoning or classic multiple instance learning. The method, formalized as the Symmetry-Enhanced Contrastive Learning-based Alignment (SECLA) loss and extended in a two-stage bootstrapped version (SECLA-B), has established state-of-the-art results on benchmarks such as Labeled Faces in the Wild (LFW) and CelebrityTogether (CelebTo), and serves as a principled framework for cross-modal instance alignment tasks (Qu et al., 2022).

1. Model Architecture and Embeddings

The framework for contrastive crop–caption loss comprises three principal components: separate encoders for faces and names, and projection modules to map these embeddings into a common latent space. Detected face crops $f_i$ are first embedded using a pre-trained FaceNet model to yield $X_f^i\in\mathbb{R}^{d_f}$ , where $d_f=512$ . Named entities $n_j$ (including a special “NONAME” token for null-links) are embedded with pre-trained BERT, resulting in $X_n^j\in\mathbb{R}^{d_n}$ , with $d_n=768$ . To harmonize dimensionality for contrastive comparison, name embeddings are projected via a 1-layer MLP $g_n:\mathbb{R}^{d_n}\to\mathbb{R}^{d_f}$ to $X_{n'}^j$ ; subsequently, both face and name vectors are further projected by a 3-layer MLP $g_c:\mathbb{R}^{d_f}\to\mathbb{R}^{d_p}$ ( $d_p=128$ ; all projections using ReLU activations). The final similarity score is computed as a simple inner product:

$\text{sim}(f_i, n_j) = (X_{f_p}^i)^\top X_{n_p}^j$

where $X_{f_p}^i = g_c(X_f^i)$ and $X_{n_p}^j = g_c(X_{n'}^j)$ .

2. Pair Definitions and Contrastive Setup

Within each training batch of $B$ image–caption pairs, each $k$ -th item provides a local set of faces $F^k = \{f_{k,1},\ldots,f_{k,n_k}\}$ and names $N^k = \{n_{k,1},\ldots,n_{k,m_k}\}$ . All intra-pair face–name pairs are taken as positive matches. Negative samples for the face-to-name direction consist of all names from other captions in the batch (and symmetrically for name-to-face). This construction provides dense contrastive supervision—a critical departure from “unidirectional” or binary loss formulations, which have demonstrated inferior performance (F1 ≈ 40–44% on LFW compared to 80.83% for SECLA).

In the SECLA-B variant, following a first-stage of easier one-to-one matches, positive sets are explicitly grounded in face–name pairs established in early-stage training, while negative samples leverage prototype embeddings from other identities.

3. Symmetry-Enhanced Contrastive Loss Formulation (SECLA)

The SECLA loss is constructed to enforce symmetric, dense alignment between faces and names at the set level. For any set of faces $F$ and names $N$ :

The dense similarity from faces to names is:

$\text{sim}_d(F,N) = \frac{1}{|F|}\sum_{f_i\in F}\max_{n_j\in N} A_{i,j} \quad \text{where} \quad A_{i,j} = \text{sim}(f_i, n_j)$

Symmetrically, the dense similarity from names to faces:

$\text{sim}_d(N,F) = \frac{1}{|N|}\sum_{n_j\in N}\max_{f_i\in F} A'_{j,i} \quad \text{where} \quad A'_{j,i} = \text{sim}(n_j, f_i)$

The batchwise contrastive losses become:

$\mathcal{L}_{f,n} = -\frac{1}{B}\sum_{k=1}^B \log \frac{\exp(\text{sim}_d(F^k, N^k))}{\sum_{\ell=1}^B\exp(\text{sim}_d(F^\ell, N^k))}$

$\mathcal{L}_{n,f} = -\frac{1}{B}\sum_{k=1}^B \log \frac{\exp(\text{sim}_d(N^k, F^k))}{\sum_{\ell=1}^B\exp(\text{sim}_d(N^\ell, F^k))}$

To further enforce bidirectional consistency, an agreement loss penalizes asymmetry between face-to-name and name-to-face similarities for each image–caption pair:

$\mathcal{L}_{agree} = \frac{1}{B}\sum_{k=1}^B \left(\text{sim}_d(F^k,N^k) - \text{sim}_d(N^k,F^k)\right)^2$

The composite SECLA objective is:

$\boxed{ \mathcal{L}_{SECLA} = \mathcal{L}_{f,n} + \mathcal{L}_{n,f} + \alpha\,\mathcal{L}_{agree} }$

with agreement weight $\alpha=0.15$ .

4. Bootstrapping and Regularization in SECLA-B

SECLA-B introduces a two-stage curriculum. Stage 1 restricts training to easy cases—pairs with a small number of faces/names and no null-links—using $\mathcal{L}_{SECLA}$ , yielding robust initial face–name links. Stage 2 generalizes to the full dataset, partitioning each batch into previously matched pairs ( $B_{match}$ ) and unmatched pairs ( $B_{unmatch}$ ). For $B_{unmatch}$ , $\mathcal{L}_{SECLA}$ continues as before. For $B_{match}$ , a bootstrapping and regularization loss ( $\mathcal{L}_{stage2}$ ) leverages prototype faces for each known identity, combining:

Bidirectional face–name–prototype contrastive loss ( $\mathcal{L}_{f,n,p}^B$ )
Face–prototype clustering loss ( $\mathcal{L}_{f,p}^B$ )

Explicitly:

$\mathcal{L}_{stage2} = \mathcal{L}_{f,n,p}^B + \mathcal{L}_{f,p}^B$

This dual-objective enforces continued learning of new correspondences while preventing forgetting of already-learned face–name links.

5. Implementation and Hyperparameters

The following key details are used in published implementations:

Component	Structure/Value	Notes
Face encoder	FaceNet, $d_f=512$	Pre-trained; fixed at inference
Name encoder	BERT, $d_n=768$	Includes “NONAME” ([UNK])
Name projector	1-layer ReLU MLP (768→512)	$g_n$
Common proj.	3-layer ReLU MLP (512→128)	$g_c$
Optimizer	Adam, lr= $3\times10^{-4}$	Batch size 20
Epochs	SECLA: 30 (LFW), 3 (CelebTo); SECLA-B: stage-1 15/5, stage-2 20/2	LFW/CelebTo
Agreement $\alpha$	0.15	No explicit temperature $\tau$

Prototype face selection options include random, medoid, and model-matched faces. An average-face prototype is suboptimal for diverse identities. “NONAME” is fixed as BERT’s [UNK] embedding, with an optional “NOFACE” prototype from noise to accommodate absent links.

6. Experimental Results and Evaluation

On the LFW and CelebrityTogether benchmarks, SECLA and SECLA-B demonstrate superior performance compared to previous methods:

Dataset	Method (Author/Year)	Main Metric	Result
LFW	Pham et al. (2010)	F1	72.66%
LFW	Hinge+top-k (Hessel et al. ’19)	F1	66.7%
LFW	Unidirectional losses	F1	40–44%
LFW	SECLA	F1	80.83% (Precision=76.96%, Recall=85.11%)
LFW	SECLA-αL_agree	F1	78.14%
LFW	SECLA-B	F1	81.86%
CelebTo	Unidirectional losses	Acc	41–45%
CelebTo	SECLA	Acc	87.46%
CelebTo	SECLA-B	Acc	88.36%

The SECLA and SECLA-B objectives thus achieve new state-of-the-art performance on these cross-modal alignment tasks, outperforming prior SOTA by a margin of up to +9 F1 points on LFW and nearly doubling accuracy over unidirectional formulations (Qu et al., 2022).

7. Significance and Adaptability

The contrastive crop–caption loss, and in particular the SECLA/SECLA-B family, provides a robust weakly supervised mechanism for set-to-set alignment in cross-modal domains without requiring explicit instance-level supervision. The symmetric approach and bootstrapped regularization explicitly address both learning and catastrophic forgetting in iterative tasks. The methodological design is adaptable to other domains involving multimodal data and cross-modal entity linking, such as news understanding and multi-object entity resolution scenarios.

This class of losses foregrounds set-level, bidirectional, and symmetry-enforcing objectives in multimodal learning. The use of pre-trained modules (FaceNet, BERT), explicit positive/negative mining at the set level, and curriculum-bootstrapped extensions position the framework as a flexible template for broader weak supervision problems in vision–language research (Qu et al., 2022).

Markdown Report Issue Upgrade to Chat

References (1)

Weakly Supervised Face Naming with Symmetry-Enhanced Contrastive Loss (2022)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Contrastive Crop–Caption Loss.