WebFace42M: Scalable Face Recognition Dataset
- The dataset provides a million-scale, cleaned face recognition training set with 42M images and 2M identities using an automated CAST pipeline.
- It features rigorous evaluation protocols including FRUITS for real-world latency constraints and diverse test sets for standard, masked, and unbiased recognition.
- WebFace42M supports training across lightweight to heavyweight architectures, achieving superior benchmarks in accuracy and fairness compared to prior datasets.
WebFace42M is a publicly accessible, million-scale face recognition training dataset comprising over 2 million identities and 42 million face images. As the cleaned backbone of the broader WebFace260M benchmark, WebFace42M is designed to catalyze research in deep face recognition at scale, supporting robust model training and rigorous, deployment-relevant evaluation. Its creation leverages a fully automated, iterative self-training pipeline to achieve high data purity, and is accompanied by comprehensive protocols and test sets reflecting industrial deployment scenarios (Zhu et al., 2022, Zhu et al., 2021).
1. Dataset Composition and Coverage
WebFace42M was distilled from the uncurated WebFace260M, which collects 4,008,130 identity "folders" containing 260,890,076 detected face images. Candidate identities were drawn from 1 million MS1M names and approximately 3 million IMDB names. Images were downloaded via Google Image Search, with faces detected and five-point aligned using RetinaFace (score ≥ 0.7). The source data spans more than 200 countries and regions, 500 professions, and a broad birth-date range from 1846 to the present.
The cleaned WebFace42M dataset contains 2,059,906 identities and 42,474,558 face images. Each identity is represented by 3 to over 300 images (mean ≈ 21), with yaw pose distribution covering ±90°, and substantial age, race, and gender diversity. An estimated label noise rate of less than 10% was achieved. The per-identity image count distribution is long-tailed: 60% of identities have 3–10 images, 30% between 10–50, and 10% more than 50 images (Zhu et al., 2022, Zhu et al., 2021).
2. Automated Cleaning Pipeline: CAST
Manual cleaning at this scale is infeasible, so the dataset was processed via the Cleaning Automatically utilizing Self-Training (CAST) pipeline—an iterative, self-training, fully automatic approach. The process consists of the following:
- Teacher Initialization: A ResNet-100 model with ArcFace loss is trained on the cleaner MS1MV2 dataset (85,000 identities, 5.8 million faces).
- Intra-class Cleaning: For each identity, 512-dimensional feature embeddings are extracted using the teacher model. DBSCAN is employed per folder to form clusters based on cosine similarity (distance ≤ ε). Only the largest cluster is retained if it contains at least 3 faces. Iterative thresholds ε dictate cosine requirements: 1-ε = {0.50, 0.55, 0.60} for successive CAST passes; n (min cluster size) is fixed at 3.
- Inter-class Cleaning:
Feature centers (means) of each folder’s embeddings are computed. Pairs of folders are: - Merged if center cosine similarity ≥ 0.7, - If 0.5 ≤ similarity < 0.7, the smaller folder is removed.
- Self-training Iteration: A new “student” model is trained on cleaned data, then promoted to teacher, repeating the intra- and inter-class cleaning on the original WebFace260M set.
- Final Pruning: Duplicate face images within identities (cosine similarity > 0.95) are removed. Identities overlapping with test-set images (center similarity > 0.7) are also discarded.
After three CAST iterations plus pruning, the pipeline converges to 2,059,906 identities and 42,474,558 faces, with estimated noise ≤ 10% (Zhu et al., 2022, Zhu et al., 2021).
3. Evaluation Protocol: FRUITS and Test Sets
To reflect real-world deployment, the Face Recognition Under Inference Time conStraint (FRUITS) protocol defines three tracks based on end-to-end CPU inference latency, executed on a single Intel Xeon E5-2630 v4 @ 2.2 GHz:
| Protocol | Latency Constraint | Typical Use Case |
|---|---|---|
| FRUITS-100 ms | ≤ 100 ms | Mobile/IoT |
| FRUITS-500 ms | ≤ 500 ms | Edge/on-premises |
| FRUITS-1000 ms | ≤ 1000 ms | Cloud/large model |
Each track encompasses complete pipelines: detection, alignment, cropping, embedding, and matching. Time is measured at batch size 1 (2 if flip test is applied), and replaces FLOP-based methodology with actual latency (Zhu et al., 2022, Zhu et al., 2021).
A new, richly annotated test set was assembled, encompassing:
- Standard Face Recognition (SFR):
2,478 identities, 57,715 faces; metrics include FNMR at fixed FMR values, with sub-protocols for large age gaps and varying scenarios (controlled, wild, cross-scene).
- Masked Face Recognition (MFR):
862 identities with 3,211 real masked faces, enabling evaluation of pre- and post-pandemic recognition performance.
- Unbiased Face Recognition (UFR):
The SFR identities are partitioned by race (Caucasian, East-Asian, African, Others) and gender (male/female), with metrics including Skewed Error Ratio (SER) and groupwise FNMR standard deviation for fairness assessment.
4. Baseline Architectures, Distributed Training, and Performance Results
WebFace42M supports evaluation of a spectrum of model sizes and architectures:
- Lightweight (FRUITS-100):
ResNet-14, MobileFaceNet, EfficientNet-B0, RegNet-800MF, with RetinaFace-M0.25 for detection/alignment.
- Mid-weight (FRUITS-500):
ResNet-50/100, SENet-50, ResNeXt-100, RegNet-8GF, with RetinaFace-R50 detection/alignment.
- Heavyweight (FRUITS-1000):
ResNet-100/200 (with flip-test), SENet-152, AttentionNet-152, RegNet-16GF.
Training full-scale networks is made feasible via synchronous data-parallel SGD, mixed-precision (FP16), and parallel communication on clusters (up to 256 GPUs with ≥80% scaling). For instance, using 32 nodes (256 GPUs), ResNet-100+ArcFace can be trained on WebFace42M in ~9 hours at 25,300 samples/s (Zhu et al., 2022, Zhu et al., 2021).
Critical performance benchmarks are as follows:
- IJB-C (ResNet-100+ArcFace):
Training with WebFace42M yields TAR@FAR=10⁻⁴ ≈ 97.70%, a 40% relative error reduction over the MS1MV2-trained baseline (~96.03%).
- NIST-FRVT (FRUITS-1000 track):
RetinaFace-R50 + ResNet-200 (flip) trained on WebFace42M ranked 3rd of 430 entries; e.g., mugshot FNMR@FMR=10⁻⁵ = 0.27%, cross-age ≥12 yrs = 0.28%.
- Masked Face Recognition (All-masked, ResNet-100+ArcFace):
FNMR@FMR=10⁻⁵ = 47.25%; simulated mask augmentation lowers this to 33.87%.
- UFR fairness (SER₍race₎, STD₍race₎ for ResNet-100):
Balanced subsampling further improves fairness, with SER₍race₎ dropping from 1.40 (MS1MV2) to 1.28 (WebFace4M), and STD₍race₎ from 0.0199 to 0.0121.
Notably, subsets representing only 10% (≈4 million faces) or 30% of WebFace42M outperformed all prior public training datasets on LFW, MegaFace, IJB-C, and the new WebFace test set.
5. Face Recognition Objective Functions and Evaluation Metrics
WebFace42M’s training standard employs ArcFace loss, which encourages discriminative angular margins between identity clusters:
with as the feature-norm scale (e.g., 64), as angular margin (0.5 radians). Clustering via DBSCAN operates on cosine distances: two faces are clustered if , with a minimum neighborhood size .
Verification metrics include False Match Rate (FMR) and False Non-Match Rate (FNMR), typically reported at fixed thresholds as:
Performance and fairness are thus evaluated with these metrics at industry-standard operating points (Zhu et al., 2022, Zhu et al., 2021).
6. Access, Usage, and Research Impact
WebFace42M and WebFace260M are publicly accessible for academic research via a research-only license. Access requires agreement not to redistribute raw images and to cite the original papers. Best-practice recommendations include:
- Pre-processing with RetinaFace for detection/alignment.
- Removal of any test protocol overlap via center similarity (threshold > 0.7).
- Training with ArcFace (standard: ); alternative losses (CosFace, CurricularFace) are also compatible.
- Distributed training with mixed-precision and parallel communication up to 256 GPUs.
- Evaluation using FNMR at fixed FMR thresholds on the FRUITS test set for all three time-constrained settings.
WebFace42M is the largest cleaned public face recognition training set to date and is explicitly constructed to close the data gap between academic and industrial research. Its automated, scalable CAST pipeline, together with high-fidelity test protocols and a broad attribute test set, provide a unified, rigorous benchmark for face recognition systems operating at scale (Zhu et al., 2022, Zhu et al., 2021).