Text-to-Image Person Retrieval (TIPR)
- Text-to-Image Person Retrieval (TIPR) is a cross-modal task that retrieves person images based on natural language queries using dual-transformer networks.
- It employs rigorous contrastive learning with a low temperature to focus on hard negatives for improved cross-modal alignment.
- Proximity Data Generation, including Stable Diffusion-based augmentation, approximate text perturbations, and feature mixup, significantly boosts retrieval accuracy.
Text-to-Image Person Retrieval (TIPR) is a cross-modal retrieval task where a free-form natural language query describing a person is used to retrieve the corresponding individual from a large gallery of images. The underlying challenge involves bridging the modality gap between textual and visual representations and resolving fine-grained semantic differences, often under constraints of limited annotated data. The literature has advanced TIPR through various architectural, learning, and data-centric innovations, with state-of-the-art approaches systematically addressing the core cross-modal alignment challenge and dataset sparsity using robust contrastive objectives and generative data augmentation.
1. Dual-Transformer Architectures and Cross-modal Contrastive Learning
Recent TIPR systems rely predominantly on dual-stream transformer architectures, with each modality—images and text—encoded by independent yet architecturally parallel transformer blocks, typically Vision Transformers (ViT) for images and BERT-style transformers for text (Wu et al., 2023). Both encoders have layers with hidden dimension . The text encoder ingests up to 64 tokens (with a prepended [SEM] token), and the final state of the [SEM] token is used as the sentence embedding . For the visual stream, RGB images (resized to ) are sliced into overlapping patches (stride 12), each linearly embedded and combined with positional encodings; the [SEM] token is prepended, and its output serves as the image embedding .
At inference, cross-modal similarity is computed by cosine similarity: The retrieval task reduces to ranking all gallery images for a given query sentence by .
Supervision is provided via a symmetric InfoNCE contrastive loss with temperature . For a batch of matched pairs : A low (e.g., ) sharpens the loss, causing the gradient to focus on the hardest negatives, which mirrors hard triplet loss behavior.
2. Proximity Data Generation (PDG): Automated Data Augmentation
A central innovation to address the data insufficiency problem is the Proximity Data Generation (PDG) module, which augments cross-modal training with semantically plausible but diverse text-image pairs (Wu et al., 2023). PDG includes three orthogonal components:
2.1 Stable Diffusion-Based Text-Image Augmentation
A fine-tuned Stable Diffusion model, adapted for the pedestrian domain, reconstructs each training image given its textual description and then edits key visual attributes (e.g., clothing color) to synthetically generate new images/prose pairs . The workflow is:
- Null-text inversion: recover the latent noise vector and optimized text context that reconstruct image from text .
- Parse for noun phrases; randomly change, e.g., a clothing color term.
- Edit prompt , generate new image via the text-to-image model reconstructed with the same latent noise.
- is added as a new aligned pair.
This process is fully automatic: no human curation or noise filtering is used, justified empirically by downstream ranking performance improvements.
2.2 Approximate Text Generation
Stochastic, conservative text perturbations are performed on original captions to create , sampling among:
- SDEL: delete random single words;
- CDEL: delete random contiguous spans;
- REPL: replace selected words via WordNet synonyms.
Each perturbation affects fraction of words. For samples with perturbed , a hinge-style regularization enforces
where is a randomly sampled negative text.
2.3 Feature-Level Mixup
With probability 0.5, per batch, paired image/text hidden states on the first transformer layer are linearly mixed: with fixed .
These mixed representations propagate through the stack and are treated as new positive pairs.
3. Empirical Evaluation and Ablation
All three augmentation strategies yield significant, complementary benefits. For CUHK-PEDES, compared to the baseline dual-Transformer + contrastive (Top-1 = 65.49%):
- +Text-Image Gen: Top-1 67.85% (+2.36)
- +Approx Text Gen: Top-1 67.52% (+2.03)
- +Feature Mixup: Top-1 66.27% (+0.78)
- Any combination yields further gains; using all three (full PDG): Top-1 69.47% (+3.98).
PDG sample quantity was swept: optimal performance (Top-1 = 70.37%) is reached at 5 synthetic pairs per identity; diminishing returns with further increases suggest a balance between diversity and over-saturation.
4. Implementation Details and Hyperparameters
- Image encoder: ViT-Base (12 layers, ), stride , patch .
- Text encoder: BERT-Base (12 layers, ), max length 64 tokens.
- Training: 4V100 GPUs, batch size 40/GPU, Adam optimizer, initial , 70 epochs, LR decays by 0.1 every 20 epochs, 10-epoch warmup.
- Contrastive loss temperature ; regularizer ; mixup ; approx-text fraction .
- PDG: one (default) to five samples per identity.
5. Benchmark Results and Comparative Performance
The method establishes new state-of-the-art on standard TIPR/TBPS tasks:
CUHK-PEDES (test set, 1,000 IDs, 3,074 images):
| Method | Top-1 | Top-5 | Top-10 | mAP |
|---|---|---|---|---|
| Previous best (IVT) | 65.59 | 83.11 | 89.21 | — |
| Ours (full PDG) | 69.47 | 87.13 | 92.13 | 60.56 |
ICFG-PEDES (test set, 1,000 IDs):
| Method | Top-1 | Top-5 | Top-10 |
|---|---|---|---|
| IVT | 56.04 | 73.60 | 80.22 |
| Ours | 57.69 | 75.79 | 82.67 |
These results represent +3.88% Top-1, +4.02% Top-5, and +2.92% Top-10 improvement over the previous state-of-the-art on CUHK-PEDES.
6. Theoretical and Practical Significance
The dual-transformer, hardness-aware contrastive learning strategy, augmented by on-the-fly PDG, represents a departure from the prior focus on elaborate local alignment modules or auxiliary supervisory heads. Notably, all local or generative augmentations introduced by the PDG module are directly integrated into training batches without explicit noise filtering or human judgement, and ablation experiments support their robustness (Wu et al., 2023).
From a practical standpoint, the implementation is straightforward:
- No specialized cross-modal branches or detectors are required;
- All PDG augmentations are implemented as batch preprocessing pipelines, compatible with distributed GPU training;
- At inference, the system reduces to dual-encoder forward passes plus a single embedding similarity computation per gallery-image/query pair, maintaining compute scalability for large-scale datasets.
This combination of (i) dual “pure” Transformer streams, (ii) small- InfoNCE that focuses on hard negatives, and (iii) a threefold proximity data expansion pipeline delivers high retrieval accuracy, strong data efficiency, and reproducibility in the context of text-based person identification.
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free