Contrastive Crop–Caption Loss
- The paper introduces SECLA/SECLA-B, a method that significantly improves cross-modal face–name alignment, achieving up to 81.86% F1 on LFW.
- It employs a symmetry-enhanced contrastive mechanism, using bidirectional similarity between face and name embeddings to robustly align image crops with caption names.
- The framework incorporates a two-stage bootstrapped approach with prototype regularization to mitigate catastrophic forgetting and enhance weak supervision.
The contrastive crop–caption loss is a class of weakly supervised objectives designed for cross-modal face–name alignment tasks, where the goal is to assign detected faces (crops) within an image to the named entities appearing in the associated caption. Employing a symmetry-enhanced contrastive mechanism, this loss operates by maximizing bidirectional similarity between sets of faces and caption names, rather than relying solely on set-level uncertainty reasoning or classic multiple instance learning. The method, formalized as the Symmetry-Enhanced Contrastive Learning-based Alignment (SECLA) loss and extended in a two-stage bootstrapped version (SECLA-B), has established state-of-the-art results on benchmarks such as Labeled Faces in the Wild (LFW) and CelebrityTogether (CelebTo), and serves as a principled framework for cross-modal instance alignment tasks (Qu et al., 2022).
1. Model Architecture and Embeddings
The framework for contrastive crop–caption loss comprises three principal components: separate encoders for faces and names, and projection modules to map these embeddings into a common latent space. Detected face crops are first embedded using a pre-trained FaceNet model to yield , where . Named entities (including a special “NONAME” token for null-links) are embedded with pre-trained BERT, resulting in , with . To harmonize dimensionality for contrastive comparison, name embeddings are projected via a 1-layer MLP to ; subsequently, both face and name vectors are further projected by a 3-layer MLP (; all projections using ReLU activations). The final similarity score is computed as a simple inner product:
where and .
2. Pair Definitions and Contrastive Setup
Within each training batch of image–caption pairs, each -th item provides a local set of faces and names . All intra-pair face–name pairs are taken as positive matches. Negative samples for the face-to-name direction consist of all names from other captions in the batch (and symmetrically for name-to-face). This construction provides dense contrastive supervision—a critical departure from “unidirectional” or binary loss formulations, which have demonstrated inferior performance (F1 ≈ 40–44% on LFW compared to 80.83% for SECLA).
In the SECLA-B variant, following a first-stage of easier one-to-one matches, positive sets are explicitly grounded in face–name pairs established in early-stage training, while negative samples leverage prototype embeddings from other identities.
3. Symmetry-Enhanced Contrastive Loss Formulation (SECLA)
The SECLA loss is constructed to enforce symmetric, dense alignment between faces and names at the set level. For any set of faces and names :
- The dense similarity from faces to names is:
- Symmetrically, the dense similarity from names to faces:
The batchwise contrastive losses become:
To further enforce bidirectional consistency, an agreement loss penalizes asymmetry between face-to-name and name-to-face similarities for each image–caption pair:
The composite SECLA objective is:
with agreement weight .
4. Bootstrapping and Regularization in SECLA-B
SECLA-B introduces a two-stage curriculum. Stage 1 restricts training to easy cases—pairs with a small number of faces/names and no null-links—using , yielding robust initial face–name links. Stage 2 generalizes to the full dataset, partitioning each batch into previously matched pairs () and unmatched pairs (). For , continues as before. For , a bootstrapping and regularization loss () leverages prototype faces for each known identity, combining:
- Bidirectional face–name–prototype contrastive loss ()
- Face–prototype clustering loss ()
Explicitly:
This dual-objective enforces continued learning of new correspondences while preventing forgetting of already-learned face–name links.
5. Implementation and Hyperparameters
The following key details are used in published implementations:
| Component | Structure/Value | Notes |
|---|---|---|
| Face encoder | FaceNet, | Pre-trained; fixed at inference |
| Name encoder | BERT, | Includes “NONAME” ([UNK]) |
| Name projector | 1-layer ReLU MLP (768→512) | |
| Common proj. | 3-layer ReLU MLP (512→128) | |
| Optimizer | Adam, lr= | Batch size 20 |
| Epochs | SECLA: 30 (LFW), 3 (CelebTo); SECLA-B: stage-1 15/5, stage-2 20/2 | LFW/CelebTo |
| Agreement | 0.15 | No explicit temperature |
Prototype face selection options include random, medoid, and model-matched faces. An average-face prototype is suboptimal for diverse identities. “NONAME” is fixed as BERT’s [UNK] embedding, with an optional “NOFACE” prototype from noise to accommodate absent links.
6. Experimental Results and Evaluation
On the LFW and CelebrityTogether benchmarks, SECLA and SECLA-B demonstrate superior performance compared to previous methods:
| Dataset | Method (Author/Year) | Main Metric | Result |
|---|---|---|---|
| LFW | Pham et al. (2010) | F1 | 72.66% |
| LFW | Hinge+top-k (Hessel et al. ’19) | F1 | 66.7% |
| LFW | Unidirectional losses | F1 | 40–44% |
| LFW | SECLA | F1 | 80.83% (Precision=76.96%, Recall=85.11%) |
| LFW | SECLA-αL_agree | F1 | 78.14% |
| LFW | SECLA-B | F1 | 81.86% |
| CelebTo | Unidirectional losses | Acc | 41–45% |
| CelebTo | SECLA | Acc | 87.46% |
| CelebTo | SECLA-B | Acc | 88.36% |
The SECLA and SECLA-B objectives thus achieve new state-of-the-art performance on these cross-modal alignment tasks, outperforming prior SOTA by a margin of up to +9 F1 points on LFW and nearly doubling accuracy over unidirectional formulations (Qu et al., 2022).
7. Significance and Adaptability
The contrastive crop–caption loss, and in particular the SECLA/SECLA-B family, provides a robust weakly supervised mechanism for set-to-set alignment in cross-modal domains without requiring explicit instance-level supervision. The symmetric approach and bootstrapped regularization explicitly address both learning and catastrophic forgetting in iterative tasks. The methodological design is adaptable to other domains involving multimodal data and cross-modal entity linking, such as news understanding and multi-object entity resolution scenarios.
This class of losses foregrounds set-level, bidirectional, and symmetry-enforcing objectives in multimodal learning. The use of pre-trained modules (FaceNet, BERT), explicit positive/negative mining at the set level, and curriculum-bootstrapped extensions position the framework as a flexible template for broader weak supervision problems in vision–language research (Qu et al., 2022).