MultiID-2M: A Multi-Person Image Dataset

Updated 18 October 2025

MultiID-2M is a large-scale multi-person image dataset designed to mitigate copy-paste artifacts in identity-conditioned image generation through its paired and unpaired photo collections.
The dataset employs ArcFace-based similarity and clustering to annotate over 25,000 unique identities, facilitating robust contrastive learning and varied image synthesis.
It is complemented by MultiID-Bench, an evaluation suite that uses metrics like Sim(GT) and M_CP to quantitatively assess identity fidelity and artifact minimization.

MultiID-2M is a large-scale, paired and unpaired multi-person image dataset constructed to address the limitations of identity-conditioned image generation models, specifically the tendency to produce copy-paste artifacts in text-to-image tasks. The dataset comprises group photographs with annotated celebrity identities and diverse reference images, providing crucial ground for benchmarking and training mechanisms that promote identity consistency and variation in generated images.

1. Dataset Structure and Acquisition

MultiID-2M consists of two principal components:

Paired multi-ID images: 500,000 group photos, each containing between one and five recognized celebrity identities. For every face, the dataset supplies identity labels and cross-checked reference images automatically assigned using ArcFace-based similarity and clustering.
Unpaired multi-ID images: 1,500,000 additional group photographs for reconstruction-based training, lacking explicit reference matches but permitting supervised learning under broader conditions.

The dataset leverages a preliminary single-ID reference bank, assembled by querying for approximately 1 million single-person celebrity web images. Post clustering via ArcFace, around 3,000 unique identities are robustly represented, each typically with hundreds of reference images—400 per identity on average during initial clustering. For group photo mining, complex queries combine celebrity names, event descriptors, and enumeration, further filtered by keyword exclusion.

Identity assignment in group photos uses ArcFace embeddings and cosine similarity (threshold $\sim$ 0.4–0.5) for cluster matching, followed by post-processing pipelines including aesthetic scoring and watermark removal. In total, the dataset references $\sim$ 25,000 unique identities, though the paired subset emphasizes the most frequently occurring ones.

2. Motivations and Design Principles

Prior state-of-the-art text-to-image models conditioned on identity have relied primarily on single-person reconstruction datasets. This typically results in “copy-paste” failure modes, where generated images are overfit toward the given reference face, showing little variation in pose, expression, or lighting. MultiID-2M counters this by providing manifold reference images for each identity within diverse group contexts.

Such variation across pose, illumination, accessory, and expressions supports the learning of robust, distributed representations. This suggests that training on MultiID-2M allows models to exhibit greater fidelity to identity while naturally incorporating the diversity found in real-world photographs. The dataset's paired nature additionally enables training regimes beyond simple pixel-wise reconstruction, including contrastive identity learning.

3. Benchmarking and Evaluation Protocols

Coinciding with MultiID-2M, the work introduces MultiID-Bench, an evaluation suite for quantitative and qualitative analysis of multi-identity generation. Each test set comprises a ground-truth image (1–4 persons), identity-specific reference images, and descriptive caption prompts.

Metrics include:

Ground-truth similarity (Sim(GT)): Cosine similarity of ArcFace embeddings between generated and ground-truth faces.

$\mathrm{Sim}(a, b) = \frac{a^T b}{\|a\| \cdot \|b\|}$

Copy-paste metric ( $M_{CP}$ ): Evaluates angular bias towards reference over ground-truth.

$M_{CP}(g \mid t, r) = \frac{\theta(t, g) - \theta(r, g)}{\max(\theta(t, r), \varepsilon)}$

with $\theta(a, b) = \arccos(\mathrm{Sim}(a, b))$ , where $r$ is the reference, $t$ the ground-truth, and $g$ the generated face embedding, $\varepsilon$ for stability.

A high $M_{CP}$ value signals that the generated face is nearly a direct copy of the reference—an undesirable outcome. The benchmark visualizes the correlation and trade-off between identity fidelity and variation, revealing mode collapse phenomena in other models. Notably, WithAnyone achieves both high Sim(GT) and low $M_{CP}$ .

4. Training Methodology Enabled by MultiID-2M

MultiID-2M supports a four-phase training paradigm:

Reconstruction Pre-Training: Model learns identity embeddings by reconstructing images from fixed prompts, spanning the entire dataset.
Captioned Pre-Training: Incorporates natural-language captions alongside visual data, aligning identity embedding with prompt semantics.
Paired Tuning: Trains on distinct image pairs of the same identity, preventing pixel copying and promoting abstract identity learning.
Quality Tuning: Refines results on curated high-quality images, optionally incorporating stylized augmentations.

Key losses:

ID Contrastive Loss (InfoNCE style):

$L_{CL} = -\log \left[ \frac{\exp(\cos(g, r)/\tau)}{\sum_j \exp(\cos(g, n_j)/\tau)} \right]$

where $g$ is the generated embedding, $r$ reference, $n_j$ negatives, $\tau$ a temperature hyperparameter.

GT-Aligned Identity Loss:

$L_{ID} = 1 - \cos( f(g(T)), f(g(T)) )$

ArcFace features extracted from generated images aligned with ground-truth landmarks.

Diffusion Loss:

$L_{diff} = \| v_\theta(x_t, t, c) - (x_1 - x_0) \|^2$

for predicting velocity between noisy and true image.

Global training objective:

$L = L_{diff} + \lambda_{ID} \cdot L_{ID} + \lambda_{CL} \cdot L_{CL}$

with $\lambda_{ID}$ and $\lambda_{CL}$ empirically set ($0.1$).

Presence of MultiID-2M's diverse paired data is critical, allowing construction of a large, informative negative pool for contrastive learning and alleviating direct copying artifacts.

5. Quantitative and Qualitative Impact

Empirical findings illustrate notable advancements:

Models trained with MultiID-2M and the novel losses attain substantially higher Sim(GT) values compared to prior baselines, with marked reduction in copy-paste scores.
On single-person MultiID-Bench subsets, WithAnyone attains or surpasses state-of-the-art identity similarity while sharply decreasing undesirable reference copying.
Qualitative samples manifest adapted faces—variance in pose, expression, and lighting—devoid of overfitting to reference features.

User studies reinforce quantitative results: participants rank WithAnyone superior in identity fidelity, controllability, and prompt adherence. The copy-paste metric demonstrates moderate correlation with human perceptual judgments, indicating utility for practical benchmarking.

6. Mathematical Formalisms and Evaluation Metrics

Essential mathematical expressions include:

Expression	Purpose	Symbol Definitions
$\mathrm{Sim}(a, b)$	Cosine feature similarity	$a, b$ ArcFace embeddings
$M_{CP}(g \mid t, r)$	Copy-paste artifact measure	$\theta(a,b)$ : angular distance, $r$ : reference, $t$ : ground-truth, $g$ : generated
$L_{CL}$	Identity contrastive loss	$g$ : generated, $r$ : reference, $n_j$ : negatives, $\tau$ : temperature
$L_{ID}$	GT-aligned identity loss	$f(\cdot)$ : ArcFace extractor, alignment to ground-truth landmarks
$L_{diff}$	Diffusion process loss	$v_\theta$ : velocity predictor, $x_t$ : noisy image, $x_0$ : true image
$L$	Total training objective	$\lambda_{ID}, \lambda_{CL}$ : balancing hyperparameters

A plausible implication is that the diversity and quantity of paired data fundamentally expand the achievable space of identity-consistent, controllable generation.

7. Context, Limitations, and Future Directions

MultiID-2M uniquely addresses the copy-paste problem in identity-conditioned synthesis, offering a scale and pairing depth previously unavailable. This enables training paradigms—particularly contrastive identity learning—unfeasible with single-person or unpaired collections. The co-release of a rigorously designed benchmark and metrics establishes fair, transparent comparative studies.

Potential limitations include bias towards frequently-occurring celebrities, inherent in web-crawled datasets, and the specificity to facial identities (using ArcFace), which may limit generalizability to non-celebrity or cross-domain identity tasks. Extension to broader population identity diversity and improved negative mining are plausible next steps.

MultiID-2M, together with the loss functions and benchmarks it enables, represents a key resource for advancing expressivity, controllability, and fidelity in multi-person identity-conditioned image generation for both research and practical deployment scenarios.

PDF Markdown Chat (Pro)

Follow Topic

Get notified by email when new papers are published related to MultiID-2M Dataset.