TGDT: Token-Guided Dual Transformer
- The paper introduces a unified dual-branch transformer that learns both coarse-grained and fine-grained representations for efficient, high-accuracy image-text retrieval using CMC loss.
- It employs two homogeneous transformer encoders to extract global and token-level features, coupled with a two-stage inference mechanism that combines precomputed global similarity with selective local re-ranking.
- Empirical analysis on benchmarks like Flickr30K and MS-COCO demonstrates that TGDT achieves state-of-the-art retrieval accuracy while drastically reducing inference time compared to traditional models.
The Token-Guided Dual Transformer (TGDT) is a unified dual-branch transformer architecture designed for efficient, high-accuracy image-text retrieval. By simultaneously learning coarse-grained (global) and fine-grained (local) representations for image and text modalities, TGDT enables both broad semantic understanding and precise regional-word alignment in retrieval settings. The framework introduces Consistent Multimodal Contrastive (CMC) training objectives to enforce intra- and inter-modal semantic consistency, and leverages an optimized two-stage inference method to achieve state-of-the-art retrieval accuracy with significantly reduced computational cost (Liu et al., 2023).
1. Architectural Framework
TGDT employs two homogeneous transformer encoder branches, one for images and one for text, each producing both global and token-level (local) embeddings. The image branch receives preprocessed tokens comprising a whole-image pooled global feature () and a set of regional features () extracted from top-36 proposals via Faster-R-CNN. The text branch ingests the [CLS] token from BERT as a global descriptor () and word-level embeddings (). Both branches utilize four-layer transformer encoders to generate output tokens: for images and for text, where and represent learned global embeddings, and and denote local regional and word embeddings, respectively.
TGDT integrates coarse-grained global retrieval, using cosine similarity between global descriptors (, ), and fine-grained local retrieval, aligning each word token with the best-matching region token using cosine similarity. Both schemes are embedded in a joint optimization framework to mutually reinforce representation learning across global and local scales.
2. Mathematical Formulation
The embedding functions are defined as:
Global similarity is computed via cosine similarity between the global image and text tokens:
Local similarity is computed as:
During inference, a mixed similarity score is used for re-ranking: with in the studied implementation.
3. Consistent Multimodal Contrastive Loss
TGDT is trained with the Consistent Multimodal Contrastive (CMC) loss, which combines an inter-modal triplet loss and an intra-modal consistency term. This enforces that paired image-text representations are close in the joint embedding space, while mirroring the relative distances of positive and negative pairs across both modalities.
The inter-modal (triplet) loss for a modality-specific similarity measure is: where and represent hard negative examples, with a margin .
The intra-modal consistency term is: with a slack .
The full CMC loss per similarity measure : The overall TGDT loss combines both global and local CMC losses:
4. Training and Optimization Details
For image preprocessing, TGDT extracts 36 object proposals per image using Faster-R-CNN with 2048-dimensional bottom-up features, augmented with a global pooled representation. Texts are tokenized and embedded using BERT, yielding 768-dimensional embeddings for the [CLS] and word tokens. Training employs the Adam optimizer (, ), a constant learning rate of , batch size of 40, and 30 epochs. No curriculum learning is used; global and local loss terms are optimized concurrently from initialization.
5. Two-Stage Inference Mechanism
TGDT's inference proceeds in two stages to optimize both efficiency and precision:
- Global Retrieval: Precompute and store all image () and text () global embeddings. For a given query, compute the cosine similarity () against the full candidate set and select the top- matches (typically ).
- Local Re-ranking: For these candidates only, compute the local similarity score () and the mixed similarity score () with . Re-rank candidates by to generate the final retrieval list.
This approach achieves the search efficiency of global retrieval across the full dataset, while leveraging local alignment precision where it is most impactful. The result is state-of-the-art accuracy with an order-of-magnitude reduction in inference time compared to contemporary cross-attention-based fine-grained retrieval models.
6. Empirical Performance Analysis
TGDT demonstrates substantial empirical gains on standard image-text retrieval benchmarks. On Flickr30K (1K images, 5 captions each):
| Method | Text→Image R@1 | R@5 | R@10 | Image→Text R@1 | R@5 | R@10 |
|---|---|---|---|---|---|---|
| TGDT-G | 55.6% | 83.1% | 89.4% | 70.3% | 91.4% | 95.5% |
| TGDT-L | 61.3% | 86.0% | 91.4% | 76.8% | 93.2% | 96.4% |
| TGDT-GL | 66.7% | 92.2% | 97.0% | 79.6% | 96.9% | 99.0% |
TGDT-GL matches or surpasses the accuracy of recent state-of-the-art approaches. On MS-COCO 1K and 5K test sets, analogous improvements are reported. Inference time for TGDT-G is reported as ≈12 s, and TGDT-GL as ≈47 s, in contrast to ≈300–650 s for other high-performing cross-attention models. This indicates TGDT-GL achieves near state-of-the-art retrieval accuracy at ~1/10–1/15 of the inference time required by alternatives.
7. Significance and Context
TGDT systematically bridges global and local cross-modal representation learning, offering a unified architecture that simultaneously leverages semantic breadth and detail. By enforcing semantic distance consistency across and within modalities under the CMC loss, TGDT mitigates trade-offs observed in prior coarse- or fine-grained retrieval paradigms. Its two-stage inference, exploiting precomputed global representations and selective local alignment, provides significant computational efficiency gains without sacrificing precision. These properties position TGDT as a reference architecture for scalable, accurate multimodal retrieval tasks and motivate further exploration of token-guided contrastive learning in cross-modal systems (Liu et al., 2023).