Face Similarity Metric: A Perceptual Approach
- Face similarity metric is a quantitative measure that captures continuous facial resemblance based on human perceptual judgments rather than binary identity verification.
- It employs a ResNet50-based ArcFace framework with triplet loss optimization on a human-annotated SimCelebA dataset to refine facial embeddings.
- The approach enhances privacy-preserving face anonymization by balancing attribute coherence with perceptual dissimilarity, outperforming traditional identity-based models.
A face similarity metric is a quantitative measure designed to assess the degree of resemblance between two face images, particularly beyond binary class identity determination. In contrast to traditional identity verification systems that output “same” versus “different” classifications, a face similarity metric seeks to model continuous similarity in perceptual appearance as judged by humans. Recent advances address the need for nuanced, identity-independent similarity assessment, particularly in applications such as privacy-preserving face anonymization, where a delicate balance between naturalism and anonymity is required. Contemporary approaches leverage metric learning frameworks with supervised annotations from human judgments, triplet-based loss formulations, and deep neural architectures to construct embeddings that reflect perceived facial similarity, rather than strict identity matching (Kumagai et al., 24 Sep 2025).
1. Motivation: From Identity Classification to Perceptual Similarity
Conventional face recognition systems and their underlying metrics—including cosine similarity applied to embeddings from models like ArcFace—are optimized for verifying or identifying identity. These systems are typically trained with losses that push apart any two different identities, with no gradient for “how different” they are perceptually. This design is limiting in face anonymization: source faces that are “highly similar but different” remain close in embedding space to the target, and the resulting swap can potentially compromise privacy (Kumagai et al., 24 Sep 2025). Moreover, using highly dissimilar sources can yield unnatural or artifact-laden results because attributes such as gender, age, and facial structure are not preserved.
To address this, a metric that quantifies “degree of facial similarity” (not just same/not-same) according to human perception is required. Such a metric supports more informed selection of swap candidates, thereby achieving a better trade-off between anonymization (maximizing difference to the original) and naturalness (sharing major attributes).
2. Human-Annotated Similarity Dataset and Problem Formulation
The construction of a perceptual similarity metric begins with the definition of a supervised learning task grounded in human judgments. The PerFace framework introduces a new dataset, SimCelebA, specifically sampled and annotated for perceptual similarity in face-swapped images. The dataset is generated as follows:
- Source: CelebAMask-HQ, a high-quality set featuring diverse facial images.
- Face swapping: SimSwap is employed to generate images where each target face is replaced by one of 240 manually selected sources.
- Annotation protocol: Each data point is a triplet , where is the reference (target), and , are two candidate swapped images. Annotators (minimum three per sample, with attention checks via dummy triplets) indicate which of or is more similar to . This design yields 6,400 reliably annotated triplets.
This dataset is constructed to capture fine-grained perceptual differences—such as the distinction between “completely different” and “highly similar but distinct” faces—which are inaccessible to traditional identity-centric labels.
3. Learning a Perceptual Similarity Metric via Triplet Loss
The PerFace metric is instantiated by finetuning a ResNet50-based ArcFace model on the triplet-annotated SimCelebA dataset. The optimization goal is to adjust the embedding space such that pairs perceived as more similar by humans are closer under the metric than less similar pairs, irrespective of class identity.
The loss function is as follows:
where:
- is the embedding of reference image ,
- is the more similar candidate (per annotators),
- is the less similar candidate,
- denotes cosine similarity,
- is a positive margin.
By minimizing , the model is encouraged to cluster reference and more similar images while separating less similar candidates by a margin. The embedding dimensionality is maintained at 512 for compatibility with common face recognition practice. This approach enables the metric to encode subtleties in human-perceived facial resemblance—critical for downstream anonymization decisions.
4. Two-Stage Candidate Selection for Face Anonymization
The PerFace framework deploys the learned similarity metric in a two-step anonymization process:
- Attribute-Based Grouping: Candidate source faces are prestratified by coarse semantic attributes (e.g., gender, age) to form attribute groups. Manual annotation is used for high reliability.
- Perceptual Similarity Ranking within Attribute Groups: To select a face swap source, the system first identifies the attribute group most closely matching the target in attribute space, then ranks candidates by their PerFace similarity score to the target. The candidate with low perceived similarity to the original (but with matched attributes) is selected for swapping.
This staged design ensures that attribute coherence (and thus naturalness) is preserved, while the similarity metric enforces the anonymity constraint by penalizing close perceptual matches.
5. Empirical Evaluation and Comparative Performance
In quantitative tests, the PerFace metric is found to outperform traditional recognition-based models (ArcFace, VGG Face, FaceNet, OpenFace, etc.) in predicting human-perceived facial similarity. Accuracy—here defined as the rate at which the model prefers the same candidate as the human annotators—reaches approximately 0.917 for PerFace, compared to scores of 0.576–0.750 for baselines. The model also demonstrates improvements when used for downstream attribute-group classification tasks: for example, discriminative power in gender and age grouping is enhanced, as seen by higher AUC and group accuracy post-finetuning.
Summary of accuracy metrics (extracted from data):
| Model | Triplet Prediction Accuracy |
|---|---|
| PerFace | ~0.917 |
| Baselines* | 0.576–0.750 |
*ArcFace, FaceNet, VGG Face, etc.
This demonstrates that learning from human perception not only boosts performance in face similarity prediction but also sharpens the semantic structure of the embedding space for attribute-driven tasks.
6. Applications and Implications
The human-perception-based face similarity metric developed in PerFace enables fine-grained, privacy-aware face anonymization by allowing selection of swap candidates that are maximally dissimilar (thus protecting identity) yet attribute-coherent (thus maintaining realism). Its utility extends to media forensics, de-biasing recognition pipelines, and other scenarios where nuanced assessment of facial resemblance is desirable.
A plausible implication is that integrating perceptual similarity supervision into face analysis models can help bridge the gap between algorithmic metrics and subjective human criteria, improving the utility of face technologies in privacy-sensitive, legal, and social contexts.
7. Limitations and Prospects
The primary limitation observed is the subjectivity inherent in human annotation of facial similarity, which may vary across demographics and cultures. The authors suggest that more diverse annotation and expanded datasets could further generalize the metric. Future directions include exploring alternative supervision schemes and integrating perceptual similarity objectives with artifact robustness and attribute disentanglement, to refine the stability and transferability of the metric.
In summary, face similarity metrics grounded in human perceptual judgments, as instantiated by the PerFace framework, represent a significant methodological advance over identity-based approaches, providing a scalable solution to nuanced facial similarity assessment in practical anonymization pipelines and related domains (Kumagai et al., 24 Sep 2025).