Papers
Topics
Authors
Recent
Search
2000 character limit reached

TGDT: Token-Guided Dual Transformer

Updated 1 February 2026
  • The paper introduces a unified dual-branch transformer that learns both coarse-grained and fine-grained representations for efficient, high-accuracy image-text retrieval using CMC loss.
  • It employs two homogeneous transformer encoders to extract global and token-level features, coupled with a two-stage inference mechanism that combines precomputed global similarity with selective local re-ranking.
  • Empirical analysis on benchmarks like Flickr30K and MS-COCO demonstrates that TGDT achieves state-of-the-art retrieval accuracy while drastically reducing inference time compared to traditional models.

The Token-Guided Dual Transformer (TGDT) is a unified dual-branch transformer architecture designed for efficient, high-accuracy image-text retrieval. By simultaneously learning coarse-grained (global) and fine-grained (local) representations for image and text modalities, TGDT enables both broad semantic understanding and precise regional-word alignment in retrieval settings. The framework introduces Consistent Multimodal Contrastive (CMC) training objectives to enforce intra- and inter-modal semantic consistency, and leverages an optimized two-stage inference method to achieve state-of-the-art retrieval accuracy with significantly reduced computational cost (Liu et al., 2023).

1. Architectural Framework

TGDT employs two homogeneous transformer encoder branches, one for images and one for text, each producing both global and token-level (local) embeddings. The image branch receives preprocessed tokens comprising a whole-image pooled global feature (v0v_0) and a set of regional features (v1,...,vrv_1, ..., v_r) extracted from top-36 proposals via Faster-R-CNN. The text branch ingests the [CLS] token from BERT as a global descriptor (l0l_0) and word-level embeddings (l1,...,lwl_1, ..., l_w). Both branches utilize four-layer transformer encoders to generate output tokens: {i0,...,ir}\{i_0, ..., i_r\} for images and {t0,...,tw}\{t_0, ..., t_w\} for text, where i0i_0 and t0t_0 represent learned global embeddings, and i1..ri_{1..r} and t1..wt_{1..w} denote local regional and word embeddings, respectively.

TGDT integrates coarse-grained global retrieval, using cosine similarity between global descriptors (i0i_0, t0t_0), and fine-grained local retrieval, aligning each word token with the best-matching region token using cosine similarity. Both schemes are embedded in a joint optimization framework to mutually reinforce representation learning across global and local scales.

2. Mathematical Formulation

The embedding functions are defined as: I=fI(V)=ITR(V),V={v0,,vr},T=fT(L)=TTR(L),L={l0,,lw}I = f_I(V) = \mathrm{ITR}(V),\quad V = \{v_0,\ldots,v_r\},\qquad T = f_T(L) = \mathrm{TTR}(L),\quad L = \{l_0,\ldots,l_w\}

Global similarity is computed via cosine similarity between the global image and text tokens: Sg(I,T)=i0,t0i0t0S_g(I, T) = \frac{\langle i_0, t_0\rangle}{\|i_0\|\,\|t_0\|}

Local similarity is computed as: Mij(I,T)=ii,tjii  tj,Sl(I,T)=1wj=1wmax1irMij(I,T)M_{ij}(I,T) = \frac{\langle i_i, t_j\rangle}{\|i_i\|\;\|t_j\|}, \quad S_l(I, T) = \frac{1}{w}\sum_{j=1}^w \max_{1 \leq i \leq r} M_{ij}(I,T)

During inference, a mixed similarity score is used for re-ranking: Sgl(I,T)=(1θ)Sg(I,T)+θSl(I,T)S_{gl}(I, T) = (1-\theta)\,S_g(I, T) + \theta\,S_l(I, T) with θ=0.5\theta = 0.5 in the studied implementation.

3. Consistent Multimodal Contrastive Loss

TGDT is trained with the Consistent Multimodal Contrastive (CMC) loss, which combines an inter-modal triplet loss and an intra-modal consistency term. This enforces that paired image-text representations are close in the joint embedding space, while mirroring the relative distances of positive and negative pairs across both modalities.

The inter-modal (triplet) loss for a modality-specific similarity measure SS is: Lr(I,T)=max(0,δS(I,T)+S(I,Tl))+max(0,δS(I,T)+S(Iv,T))L_r(I, T) = \max\bigl(0, \delta - S(I, T) + S(I, T_{l^-})\bigr) + \max\bigl(0, \delta - S(I, T) + S(I_{v^-}, T)\bigr) where TlT_{l^-} and IvI_{v^-} represent hard negative examples, with a margin δ=0.2\delta = 0.2.

The intra-modal consistency term is: La(I,T)=max(0,S(I,Il)S(T,Tl)σ)+max(0,S(I,Iv)S(T,Tv)σ)L_a(I, T) = \max\bigl(0,\,|S(I,I_{l^-}) - S(T,T_{l^-})| - \sigma\bigr) + \max\bigl(0,\,|S(I,I_{v^-}) - S(T,T_{v^-})| - \sigma\bigr) with a slack σ=0.3\sigma = 0.3.

The full CMC loss per similarity measure SS: Lcmc(S)=Lr(S)+La(S)L_{\mathrm{cmc}}(S) = L_r(S) + L_a(S) The overall TGDT loss combines both global and local CMC losses: LTGDT=Lcmc(Sg)+Lcmc(Sl)L_{\mathrm{TGDT}} = L_{\mathrm{cmc}}(S_g) + L_{\mathrm{cmc}}(S_l)

4. Training and Optimization Details

For image preprocessing, TGDT extracts 36 object proposals per image using Faster-R-CNN with 2048-dimensional bottom-up features, augmented with a global pooled representation. Texts are tokenized and embedded using BERT, yielding 768-dimensional embeddings for the [CLS] and word tokens. Training employs the Adam optimizer (β1=0.9\beta_1 = 0.9, β2=0.999\beta_2 = 0.999), a constant learning rate of 1×1061 \times 10^{-6}, batch size of 40, and 30 epochs. No curriculum learning is used; global and local loss terms are optimized concurrently from initialization.

5. Two-Stage Inference Mechanism

TGDT's inference proceeds in two stages to optimize both efficiency and precision:

  1. Global Retrieval: Precompute and store all image (i0i_0) and text (t0t_0) global embeddings. For a given query, compute the cosine similarity (SgS_g) against the full candidate set and select the top-KK matches (typically K=100K=100).
  2. Local Re-ranking: For these KK candidates only, compute the local similarity score (SlS_l) and the mixed similarity score (SglS_{gl}) with θ=0.5\theta = 0.5. Re-rank candidates by SglS_{gl} to generate the final retrieval list.

This approach achieves the search efficiency of global retrieval across the full dataset, while leveraging local alignment precision where it is most impactful. The result is state-of-the-art accuracy with an order-of-magnitude reduction in inference time compared to contemporary cross-attention-based fine-grained retrieval models.

6. Empirical Performance Analysis

TGDT demonstrates substantial empirical gains on standard image-text retrieval benchmarks. On Flickr30K (1K images, 5 captions each):

Method Text→Image R@1 R@5 R@10 Image→Text R@1 R@5 R@10
TGDT-G 55.6% 83.1% 89.4% 70.3% 91.4% 95.5%
TGDT-L 61.3% 86.0% 91.4% 76.8% 93.2% 96.4%
TGDT-GL 66.7% 92.2% 97.0% 79.6% 96.9% 99.0%

TGDT-GL matches or surpasses the accuracy of recent state-of-the-art approaches. On MS-COCO 1K and 5K test sets, analogous improvements are reported. Inference time for TGDT-G is reported as ≈12 s, and TGDT-GL as ≈47 s, in contrast to ≈300–650 s for other high-performing cross-attention models. This indicates TGDT-GL achieves near state-of-the-art retrieval accuracy at ~1/10–1/15 of the inference time required by alternatives.

7. Significance and Context

TGDT systematically bridges global and local cross-modal representation learning, offering a unified architecture that simultaneously leverages semantic breadth and detail. By enforcing semantic distance consistency across and within modalities under the CMC loss, TGDT mitigates trade-offs observed in prior coarse- or fine-grained retrieval paradigms. Its two-stage inference, exploiting precomputed global representations and selective local alignment, provides significant computational efficiency gains without sacrificing precision. These properties position TGDT as a reference architecture for scalable, accurate multimodal retrieval tasks and motivate further exploration of token-guided contrastive learning in cross-modal systems (Liu et al., 2023).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Token-Guided Dual Transformer (TGDT).