Papers
Topics
Authors
Recent
2000 character limit reached

Text-to-Image Person Retrieval (TIPR)

Updated 17 November 2025
  • Text-to-Image Person Retrieval (TIPR) is a cross-modal task that retrieves person images based on natural language queries using dual-transformer networks.
  • It employs rigorous contrastive learning with a low temperature to focus on hard negatives for improved cross-modal alignment.
  • Proximity Data Generation, including Stable Diffusion-based augmentation, approximate text perturbations, and feature mixup, significantly boosts retrieval accuracy.

Text-to-Image Person Retrieval (TIPR) is a cross-modal retrieval task where a free-form natural language query describing a person is used to retrieve the corresponding individual from a large gallery of images. The underlying challenge involves bridging the modality gap between textual and visual representations and resolving fine-grained semantic differences, often under constraints of limited annotated data. The literature has advanced TIPR through various architectural, learning, and data-centric innovations, with state-of-the-art approaches systematically addressing the core cross-modal alignment challenge and dataset sparsity using robust contrastive objectives and generative data augmentation.

1. Dual-Transformer Architectures and Cross-modal Contrastive Learning

Recent TIPR systems rely predominantly on dual-stream transformer architectures, with each modality—images and text—encoded by independent yet architecturally parallel transformer blocks, typically Vision Transformers (ViT) for images and BERT-style transformers for text (Wu et al., 2023). Both encoders have L=12L=12 layers with hidden dimension D=768D=768. The text encoder ingests up to 64 tokens (with a prepended [SEM] token), and the final state of the [SEM] token is used as the sentence embedding fTf_T. For the visual stream, RGB images (resized to 384×128384{\times}128) are sliced into overlapping 16×1616{\times}16 patches (stride 12), each linearly embedded and combined with positional encodings; the [SEM] token is prepended, and its output zL0z_L^0 serves as the image embedding fIf_I.

At inference, cross-modal similarity is computed by cosine similarity: S(s,x)=fTfIfTfIS(s, x) = \frac{f_T \cdot f_I}{\|f_T\|\|f_I\|} The retrieval task reduces to ranking all gallery images for a given query sentence by S(s,x)S(s, x).

Supervision is provided via a symmetric InfoNCE contrastive loss with temperature τ\tau. For a batch of NN matched pairs (sj,xj)(s_j, x_j): Lt2i(j)=logexp(fTjfIj/τ)k=1Nexp(fTjfIk/τ) Li2t(j)=logexp(fIjfTj/τ)k=1Nexp(fIjfTk/τ) Lcontrast=j=1N[Lt2i(j)+Li2t(j)]\begin{align*} \mathcal{L}_{t2i}(j) &= -\log \frac{\exp(f_T^j \cdot f_I^j/\tau)}{\sum_{k=1}^N \exp(f_T^j \cdot f_I^k/\tau)} \ \mathcal{L}_{i2t}(j) &= -\log \frac{\exp(f_I^j \cdot f_T^j/\tau)}{\sum_{k=1}^N \exp(f_I^j \cdot f_T^k/\tau)} \ \mathcal{L}_{\mathrm{contrast}} &= \sum_{j=1}^N [\mathcal{L}_{t2i}(j) + \mathcal{L}_{i2t}(j)] \end{align*} A low τ\tau (e.g., τ=0.005\tau{=}0.005) sharpens the loss, causing the gradient to focus on the hardest negatives, which mirrors hard triplet loss behavior.

2. Proximity Data Generation (PDG): Automated Data Augmentation

A central innovation to address the data insufficiency problem is the Proximity Data Generation (PDG) module, which augments cross-modal training with semantically plausible but diverse text-image pairs (Wu et al., 2023). PDG includes three orthogonal components:

2.1 Stable Diffusion-Based Text-Image Augmentation

A fine-tuned Stable Diffusion model, adapted for the pedestrian domain, reconstructs each training image given its textual description and then edits key visual attributes (e.g., clothing color) to synthetically generate new images/prose pairs (s,x)(s', x'). The workflow is:

  1. Null-text inversion: recover the latent noise vector and optimized text context that reconstruct image xx from text ss.
  2. Parse ss for noun phrases; randomly change, e.g., a clothing color term.
  3. Edit prompt sss \to s', generate new image xx' via the text-to-image model reconstructed with the same latent noise.
  4. (s,x)(s', x') is added as a new aligned pair.

This process is fully automatic: no human curation or noise filtering is used, justified empirically by downstream ranking performance improvements.

2.2 Approximate Text Generation

Stochastic, conservative text perturbations are performed on original captions to create ss', sampling among:

  • SDEL: delete random single words;
  • CDEL: delete random contiguous spans;
  • REPL: replace selected words via WordNet synonyms.

Each perturbation affects σ=0.2\sigma{=}0.2 fraction of words. For samples with perturbed ss', a hinge-style regularization enforces

S(s,x)>S(s,x)>S(s,x)S(s, x) > S(s', x) > S(s^-, x)

where ss^- is a randomly sampled negative text.

2.3 Feature-Level Mixup

With probability 0.5, per batch, paired image/text hidden states on the first transformer layer are linearly mixed: z^1T=λz1,1T+(1λ)z1,2T,z^1I=λz1,1I+(1λ)z1,2I\hat{z}_1^T = \lambda z_{1,1}^T + (1-\lambda)z_{1,2}^T, \quad \hat{z}_1^I = \lambda z_{1,1}^I + (1-\lambda)z_{1,2}^I with fixed λ=0.5\lambda=0.5.

These mixed representations propagate through the stack and are treated as new positive pairs.

3. Empirical Evaluation and Ablation

All three augmentation strategies yield significant, complementary benefits. For CUHK-PEDES, compared to the baseline dual-Transformer + contrastive (Top-1 = 65.49%):

  • +Text-Image Gen: Top-1 \rightarrow 67.85% (+2.36)
  • +Approx Text Gen: Top-1 \rightarrow 67.52% (+2.03)
  • +Feature Mixup: Top-1 \rightarrow 66.27% (+0.78)
  • Any combination yields further gains; using all three (full PDG): Top-1 \rightarrow 69.47% (+3.98).

PDG sample quantity was swept: optimal performance (Top-1 = 70.37%) is reached at 5 synthetic pairs per identity; diminishing returns with further increases suggest a balance between diversity and over-saturation.

4. Implementation Details and Hyperparameters

  • Image encoder: ViT-Base (12 layers, D=768D=768), stride w=12w=12, patch P=16P=16.
  • Text encoder: BERT-Base (12 layers, D=768D=768), max length 64 tokens.
  • Training: 4×\timesV100 GPUs, batch size 40/GPU, Adam optimizer, initial lr=1×104\text{lr}=1 \times 10^{-4}, 70 epochs, LR decays by 0.1 every 20 epochs, 10-epoch warmup.
  • Contrastive loss temperature τ=0.005\tau=0.005; regularizer μ=0.1\mu=0.1; mixup λ=0.5\lambda=0.5; approx-text fraction σ=0.2\sigma{=}0.2.
  • PDG: one (default) to five (s,x)(s', x') samples per identity.

5. Benchmark Results and Comparative Performance

The method establishes new state-of-the-art on standard TIPR/TBPS tasks:

CUHK-PEDES (test set, 1,000 IDs, 3,074 images):

Method Top-1 Top-5 Top-10 mAP
Previous best (IVT) 65.59 83.11 89.21
Ours (full PDG) 69.47 87.13 92.13 60.56

ICFG-PEDES (test set, 1,000 IDs):

Method Top-1 Top-5 Top-10
IVT 56.04 73.60 80.22
Ours 57.69 75.79 82.67

These results represent +3.88% Top-1, +4.02% Top-5, and +2.92% Top-10 improvement over the previous state-of-the-art on CUHK-PEDES.

6. Theoretical and Practical Significance

The dual-transformer, hardness-aware contrastive learning strategy, augmented by on-the-fly PDG, represents a departure from the prior focus on elaborate local alignment modules or auxiliary supervisory heads. Notably, all local or generative augmentations introduced by the PDG module are directly integrated into training batches without explicit noise filtering or human judgement, and ablation experiments support their robustness (Wu et al., 2023).

From a practical standpoint, the implementation is straightforward:

  • No specialized cross-modal branches or detectors are required;
  • All PDG augmentations are implemented as batch preprocessing pipelines, compatible with distributed GPU training;
  • At inference, the system reduces to dual-encoder forward passes plus a single embedding similarity computation per gallery-image/query pair, maintaining compute scalability for large-scale datasets.

This combination of (i) dual “pure” Transformer streams, (ii) small-τ\tau InfoNCE that focuses on hard negatives, and (iii) a threefold proximity data expansion pipeline delivers high retrieval accuracy, strong data efficiency, and reproducibility in the context of text-based person identification.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Text-to-Image Person Retrieval (TIPR).