Identity-Labeled Gallery Framework

Updated 29 December 2025

Identity-labeled galleries are curated sets of media samples annotated with explicit identity labels, enabling fine-grained instance discrimination.
They support rigorous evaluation in tasks such as retrieval, verification, and generative assessment by employing specialized construction and labeling methodologies.
Robust design practices including manual curation, sampling, and dynamic updating are essential for ensuring accuracy and resilience against domain shifts.

An identity-labeled gallery is a curated set of media samples (typically images) in which each sample is annotated with an explicit identity label, such as an individual, category, or fine-grained instance. These galleries are foundational to evaluation and deployment tasks in recognition, retrieval, verification, and generative model assessment, enabling rigorous measurement of identity-preserving properties. Gallery construction, labeling strategies, and the interplay with downstream algorithms are active areas of methodological development, with substantial implications for both accuracy and robustness across application domains.

1. Formal Structure and Core Concepts

An identity-labeled gallery is formally represented as $\mathcal{G} = \{(x_i, y_i)\}_{i=1}^M$ , where each $x_i$ is a data sample (image, video clip, etc.) and each $y_i$ is its associated identity label (e.g., species, individual, class, or object instance) (Kilrain et al., 22 Dec 2025). The central characteristic is that the gallery supports discriminating not just semantic or categorical similarity, but instance-level or fine-grained identity information. These galleries may encode identities at multiple granularities—category, subclass, or unique instance—depending on the evaluation protocol and the underlying dataset (Kilrain et al., 22 Dec 2025, Yao et al., 2023).

Identity-labeled galleries are essential for protocols including:

Retrieval-based evaluation and ranking for fine-grained discrimination (Kilrain et al., 22 Dec 2025)
One-to-many and open-set identification in verification pipelines (Bhatta et al., 8 Aug 2025, Andrews et al., 2019)
Semi-supervised representation learning under limited labels (Gao et al., 2016)
Robust evaluation against confounders such as nuisance variation or domain shift (Koniusz et al., 2018)

2. Gallery Construction Methodologies

The construction of identity-labeled galleries is critical to downstream performance and robustness. Several protocols exist, tailored to different circumstances:

Manual curation and annotation: Direct labeling of samples, often with multiple annotators, is standard for domain-adaptation and exhibit-identification datasets such as Open MIC, which provides per-image labels for 866 classes (exhibits) and saliency-ordered multi-label annotations for query samples (Koniusz et al., 2018).
Sampling and cleansing: In large-scale face ID settings, raw collections are systematically cleaned through outlier detection (distance to identity centroid in the feature space), redundancy pruning (removing near-duplicates based on cosine similarity), and generative augmentation to cover feature-space holes (Roh et al., 2023).
Enriching with cross-modal features: Gallery enrichment incorporates external cues, such as adding images from a query set to the gallery after matching face features using an unsupervised face identification algorithm, increasing gallery coverage and identity robustness in clothes-changing person re-ID (Arkushin et al., 2022).
Dynamic/incremental construction: In camera-incremental object ReID, the gallery (identity memory) evolves by merging per-camera means and incorporating new identities on-the-fly via cosine-based cycle-consistent association between new and historical galleries (Yao et al., 2023).

A summary of construction methods:

Task Domain	Gallery Construction Approach	Reference
Face ID	Outlier removal, redundancy pruning, GAN/VAE feature augmentation	(Roh et al., 2023)
Clothes-chg. ReID	Gallery enrichment via face-identity matching	(Arkushin et al., 2022)
Multi-camera ReID	Cycle-consistent identity memory merge	(Yao et al., 2023)
Few-shot face recognition	Mean computation, semi-supervised rectification	(Gao et al., 2016)
Museum exhibit ID	Manual per-image labeling with multi-label queries	(Koniusz et al., 2018)

3. Algorithmic Use in Evaluation Protocols

Identity-labeled galleries are the backbone of modern evaluation and benchmarking protocols:

Fine-Grained Retrieval and mAP Measurement

In the Finer-Personalization Rank protocol, each generated or probe image $x_{gen}$ is projected into an embedding space via an encoder $E$ , and its cosine similarity to each gallery member is computed. The gallery is ranked by similarity, and identity preservation is measured by mean average precision (mAP), which sensitively detects whether instance-specific details (such as unique markings) have been maintained. This process is deterministic given the gallery and encoder (Kilrain et al., 22 Dec 2025).

Open-Set Identification and In-/Out-Gallery Detection

For open-set face identification, the gallery comprises multiple samples per identity. A probe is matched against the gallery, and the rank positions of co-identity samples are used as features. An MLP trained on these rank vectors discriminates in-gallery (true identity present) versus out-of-gallery (false match) probes, outperforming classic score-thresholding, and remaining stable under severe probe degradations (Bhatta et al., 8 Aug 2025).

Re-ID and Incremental Gallery Evolution

Incremental ReID settings require galleries (identity memories) that can dynamically merge and expand as new camera streams arrive. Cycle-consistent best-match strategies are employed to associate new identities with historical ones, followed by momentum merging for existing identities and expansion for newly discovered ones. This maintains continuity of identity representation across dynamically varying acquisition conditions (Yao et al., 2023).

4. Impact of Gallery Design on Retrieval, Verification, and Generation

Empirical studies consistently demonstrate that properties of the gallery—coverage, diversity, purity, and granularity—directly drive the sensitivity and accuracy of downstream pipelines:

Finer-Personalization Rank benchmarking revealed that standard semantic similarity metrics (e.g., CLIP, DINO) are insensitive to omissions of discriminative instance details, whereas gallery-based mAP, especially with a specialized encoder, exposes identity drift, with method performance varying from 0.2 to 0.8 mAP compared to tightly clustered CLIP scores (0.8–0.9) (Kilrain et al., 22 Dec 2025).
Gallery curation via pruning reduced the average per-identity gallery by 75%, decreased false negative identification rates at fixed FPIR by over 70%, and doubled the online search speed (Roh et al., 2023).
Enriching the gallery using face features—without retraining—yielded Top-1 clothes-changing accuracy gains of 33–55 percentage points across several ReID benchmarks, illustrating the utility of cross-modal enrichment for addressing hard confounders such as inter-session clothing variation (Arkushin et al., 2022).
Dynamic gallery approaches in camera-incremental ReID preserved discrimination by continually merging new identity means while distilling knowledge from both local and historical galleries (Yao et al., 2023).

5. Specialized and Application-Specific Gallery Protocols

Specialized gallery construction and evaluation protocols adapt to application-specific requirements:

Fine-grained category/instance protocols: In CUB, Stanford Cars, and wildlife Re-ID, galleries are constructed from highly similar fine-grained categories or individual instances, with manual removal of near-duplicates and optional k-means clustering for diversity. This yields challenging discrimination tasks focused on subtle discriminative cues (Kilrain et al., 22 Dec 2025).
Open-set verification and attacks: Face verification systems using identity-labeled galleries are vulnerable to poisoning via multiple-identity images (MIIs), which exploit the geometry of embedding space to position attacks at the midpoint between two valid identities, so that both are accepted within a typical verification threshold (Andrews et al., 2019).
Semi-supervised refinement: When labeled samples per identity are scarce, semi-supervised sparse representation and Gaussian mixture models are used to estimate and refine gallery prototypes, employing both labeled and unlabeled data, as in S $^3$ RC (Gao et al., 2016).
Domain adaptation: Galleries are partitioned into source (training conditions) and target (deployment domain), and feature/statistics alignment is performed per identity or class to compensate for domain shift, as in Open MIC (Koniusz et al., 2018).

6. Limitations, Recommendations, and Best Practices

Gallery-based methods, while powerful, require careful design and interpretation:

Copy-paste artifacts can trivially boost mAP (if a generative model simply replicates the reference), so mAP should be reported alongside prompt-following or generative suitability metrics (Kilrain et al., 22 Dec 2025).
Background removal is largely unnecessary with category-specialized embedding models; context can be maintained for realism (Kilrain et al., 22 Dec 2025).
Prompt-following and identity metrics should be paired: identity-labeled galleries are necessary for measuring preservation, but secondary metrics are required to ensure style or task compliance (Kilrain et al., 22 Dec 2025).
Automatic gallery construction (e.g., via sample pruning and generative augmentation) enhances scalability and reduces human error, but parameter selection (outlier and redundancy thresholds) must be tuned to data characteristics (Roh et al., 2023).
Open-world scenarios remain challenging; when a query's true identity is absent from the gallery, most current pipelines do not reliably flag this without auxiliary classifiers (Bhatta et al., 8 Aug 2025, Arkushin et al., 2022).

Identity-labeled gallery methodologies constitute a pillar of contemporary recognition, verification, and generative evaluation pipelines, providing a practical and principled mechanism for quantifying fine-grained identity preservation, enabling robust system design, and facilitating fair benchmarking under realistic deployment scenarios.