WIDER Face: Large-Scale Face Detection Dataset

Updated 17 May 2026

WIDER FACE is a large-scale face detection benchmark featuring 32,203 images and 393,703 labeled faces, addressing challenges like scale, occlusion, and pose variations.
The dataset offers detailed annotations including tight bounding boxes, occlusion levels, pose information, and event categories, enabling fine-grained performance evaluation.
It establishes rigorous evaluation protocols and baseline comparisons, driving research in multi-scale detection, occlusion handling, and pose-aware methods.

WIDER FACE is a large-scale face detection benchmark designed to address deficiencies in previous datasets and capture the challenges of real-world conditions. It comprises 32,203 images with 393,703 labeled face bounding boxes spanning 60 "event" categories, exhibiting substantial variations in scale, pose, and occlusion. Its scale, annotation richness, and systematic benchmarks have established it as a standard for evaluating and developing methods robust to diverse facial appearances and contexts (Yang et al., 2015).

1. Dataset Structure and Composition

WIDER FACE includes images collected from 60 real-world event categories defined according to the Large-Scale Concept Ontology for Multimedia (LSCOM), such as Parade, Basketball, Funeral, and Riot. Image collection was performed by querying large web image search engines (Google, Bing) to obtain 1,000–3,000 images per event, followed by manual curation to remove images without faces and eliminate near duplicates to maximize diversity.

Each image contains one or more faces annotated by bounding boxes, tightly enclosing the forehead, chin, and cheeks, with coordinates $\left[x_1, y_1, x_2, y_2\right]$ . Boxes where face height is $\leq 10$ pixels are flagged as "Ignore" due to labeling ambiguity. Faces are further labeled for three categorical attributes:

Occlusion: None, Partial (1–30% occluded area), Heavy (>30% occlusion)
Pose: Typical (within $|$ roll $| \leq 30^\circ$ , $|$ pitch $| \leq 30^\circ$ , $|$ yaw $| \leq 90^\circ$ ) or Atypical (outside these ranges)
Event category: Explicit event label from the LSCOM list

All annotations were performed by a primary annotator and subsequently cross-checked by two additional annotators to enforce consistency.

2. Statistical Analysis: Faces, Scale, and Variation

The dataset reflects the complexity of unconstrained face detection. Distribution statistics include:

Attribute	Categories / Ranges	Proportion / Value
Total images	—	32,203
Total faces	—	393,703
Event categories	60 (LSCOM-based)	Full list in appendix
Scale distribution	Small: 10–50 px	≈50% of faces
	Medium: 50–300 px	≈43%
	Large: >300 px	≈7%
Occlusion	None / Partial / Heavy	All explicitly labeled
Pose	Typical / Atypical	Defined by roll/pitch/yaw

The data is further split randomly per event class into 40% train, 10% validation, and 50% test, resulting in:

Training (train + val): 16,000 images, 199,000 faces
Testing: 16,000 images, 194,000 faces

This yields a dataset roughly an order of magnitude larger than preceding face detection benchmarks.

3. Annotation Protocols and Data Collection Methods

The annotation workflow comprises three main stages:

Retrieval: 1,000–3,000 images per event from Google/Bing using 60 LSCOM event queries.
Manual Curation: Eliminating images without any identifiable human face.
De-duplication: Removing near-duplicate frames to ensure appearance diversity.

All recognizable faces larger than approximately 10 pixels in height are annotated. Images are not preprocessed beyond this manual filtering and annotation procedure. Every annotated face is associated with the three major attribute labels (occlusion, pose, event). The annotation protocol establishes a "tight" bounding box target, which increases the challenge associated with occlusions and pose variability.

4. Evaluation Protocol and Benchmark Metrics

WIDER FACE defines a rigorous evaluation scheme based on standard object detection measures:

Bounding box matching via Intersection over Union (IoU): prediction $B_{pred}$ matches ground-truth $B_{gt}$ if

$\leq 10$ 0

Precision/Recall/Average Precision (AP): AP is the area under the precision-recall curve ( $\leq 10$ 1), computed as

$\leq 10$ 2

mean Average Precision (mAP): With a single "face" class, mAP reduces to AP.
ROC curves: Reported as true positive rate versus false positives per image, in parallel to the FDDB protocol.

Evaluation is stratified by difficulty, based on recall with EdgeBox proposals (8,000 per image):

Easy: $\leq 10$ 392% avg recall
Medium: $\leq 10$ 476%
Hard: $\leq 10$ 534%

AP curves are reported for each difficulty subset, as well as broken down by scale, occlusion, and pose. This enables fine-grained diagnosis of detector shortcomings.

5. Baseline Detection Systems and Benchmark Results

Representative off-the-shelf face detectors were assessed on WIDER FACE:

Viola-Jones (VJ)
Deformable Part Model (DPM, "HeadHunter")
Aggregate Channel Features (ACF-multiscale)
Faceness

Benchmark performance on the test subset is summarized as:

Detector	Easy / Medium / Hard AP (%)	Large Faces AP (%)	Medium Faces AP (%)	Small Faces AP (%)
VJ	∼50 / 35 / 20	≥80 (all)	—	<12 (all detectors)
DPM	∼65 / 50 / 28	≥80 (all)	—	<12 (all detectors)
ACF	∼70 / 55 / 25	≥80 (all)	—	<12 (all detectors)
Faceness	∼75 / 60 / 30	≈90	≈70 (Faceness best)	<12 (all detectors)

Other breakdowns highlight major challenges:

Partial occlusion (≥30 px faces): best AP ~26.5% (Faceness)
Heavy occlusion: best AP ~14.4% (Faceness)
Atypical pose: recall drops below 20% for all detectors
Event-level: AP ranges from ~35% (most difficult, e.g., Festival/Parade) to >85% (easiest, e.g., Surgeons/Spa) when retrained on WIDER FACE

The authors also introduced a multi-scale two-stage Cascade CNN, comprising:

Proposal stage: Four fully-convolutional networks, each tailored to a specific scale (input sizes: $\leq 10$ 6), trained with joint face/no-face plus scale classification.
Refinement stage: Per-scale CNNs refine proposals, jointly performing face/no-face classification (at IoU $\leq 10$ 7 0.5) and bounding-box regression with $\leq 10$ 8 loss.

On the "hard" subset, the Cascade CNN improved AP by approximately 8.5% over the best retrained Faceness baseline, demonstrating the benefit of explicit scale-specialized architectures.

6. Failure Modes and Research Challenges

Persistent failure cases identified in benchmark analysis include:

Small faces (<50 px): AP remains below ∼12% for all detectors.
Heavy occlusion (>30%): Best AP is only ~14%.
Extreme pose (yaw $\leq 10$ 990°, roll/pitch $|$ 030°): Recall drops below 20%.
Blur, low resolution (<10 px): Such cases are marked "Ignore."
Complex backgrounds and rare event categories: For events such as Festival and Parade, per-class AP remains below 50%.

These observations reveal the substantial headroom for improvement in unconstrained face detection tasks.

7. Recommended Research Directions

The dataset authors enumerate multiple open problems and avenues for advancement:

Improved small-face representations, including super-resolution and scale-aware feature pyramids.
Part-based or occlusion-robust models that explicitly model visible versus occluded facial landmarks.
Pose-aware architectures capable of handling 360° yaw.
Enhanced hard-negative mining and multi-task learning (e.g., simultaneous face detection and alignment).
Exploiting available event/context metadata to inform detection, potentially enabling scene-guided priors.

The combination of large scale, granular annotations, and challenging real-world variability establishes WIDER FACE as a principal resource for advancing face detection methodologies, particularly on hard, occluded, and unconstrained cases (Yang et al., 2015).

Markdown Report Issue Upgrade to Chat

References (1)

WIDER FACE: A Face Detection Benchmark (2015)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to WIDER Face Dataset.