Papers
Topics
Authors
Recent
Search
2000 character limit reached

TransFG: Transformer-Based FGVC

Updated 12 March 2026
  • TransFG is a transformer-based architecture for fine-grained visual classification that uses self-attention to localize discriminative object parts.
  • It integrates overlapping patch splitting, multi-head attention, and top-K patch selection to enhance part discovery and relational modeling.
  • By combining contrastive and cross-entropy losses, TransFG improves classification accuracy by up to 3% over baseline ViT on multiple benchmarks.

TransFG is a transformer-based architecture for fine-grained visual classification (FGVC) that introduces a self-attention guided pipeline for discriminative part localization and relational modeling. By leveraging multi-layer, multi-head attention and a contrastive loss strategy, TransFG achieves state-of-the-art performance on FGVC benchmarks while maintaining architectural simplicity and end-to-end trainability (He et al., 2021).

1. Baseline: Vision Transformer for Fine-Grained Recognition

TransFG builds on the Vision Transformer (ViT) framework, specifically the ViT-B_16 variant. The standard ViT baseline processes input images IRH×W×CI \in \mathbb{R}^{H \times W \times C} (e.g., 448×448×3448 \times 448 \times 3) by splitting them into NN overlapping patches of size P×PP \times P using stride SS (in TransFG, P=16P=16, S=12S=12). The number of patches is computed as N=NHNWN = N_H \cdot N_W where NH=(HP+S)/SN_H = \lfloor (H-P+S)/S \rfloor and NW=(WP+S)/SN_W = \lfloor (W-P+S)/S \rfloor. Each patch xpiRP2Cx_p^i \in \mathbb{R}^{P^2 C} is embedded via a linear projection, appended with positional encodings EposR(N+1)×DE_{\text{pos}} \in \mathbb{R}^{(N+1)\times D}, and a learned [CLS] token.

The transformer encoder consists of LL stacked blocks, each with multi-head self-attention (MSA) and MLP modules, using pre-norm and residual connections. For ViT-B_16, L=12L=12 layers, D=768D=768, H=12H=12 heads, and an MLP size of $3072$ are typical. The final [CLS] token output after LL layers is used for global classification via a linear head. Baseline ViT-B_16 fine-tuned for FGVC achieves the following accuracies:

Dataset ViT-B_16 (%)
CUB-200-2011 90.3
Stanford Cars 93.7
Stanford Dogs 91.7
NABirds 89.9
iNat2017 68.7

2. Part Selection Module: Discriminative Patch Localization

The core innovation in TransFG is the Part Selection Module (PSM), which operationalizes transformer attention weights for fine-grained part discovery.

  • Attention Extraction: At each of the first L1L-1 layers and HH attention heads, the attention matrix AlhR(N+1)×(N+1)A_l^h \in \mathbb{R}^{(N+1)\times(N+1)} is processed; the [CLS] token’s attention to all image patches is recorded as alhRNa_l^h \in \mathbb{R}^N with (alh)i=Alh[0,i](a_l^h)_i = A_l^h[0, i].
  • Integrated Attention Map: The model aggregates attention across all layers and heads into ARNA \in \mathbb{R}^N using Ai=l=1L1h=1H(alh)iA_i = \sum_{l=1}^{L-1}\sum_{h=1}^{H} (a_l^h)_i. This can be reshaped into an attention heatmap over spatial locations.
  • Top-K Patch Selection: The indices {i1,,iK}\{i_1,\ldots,i_K\} of the largest KK values in AA identify the most discriminative patches; KK is set equal to the number of heads (K=12K=12).
  • Local Sequence Construction: The local token sequence Zlocal=[ZL10;ZL1i1;;ZL1iK]Z_{\text{local}} = [Z_{L-1}^0; Z_{L-1}^{i_1};\ldots; Z_{L-1}^{i_K}] is formed by concatenating the [CLS] token and the selected patch tokens immediately before the final transformer layer.

A single transformer block operates on this sequence, explicitly modeling interactions among the global representation ([CLS]) and selected discriminative parts.

3. Loss Function: Contrastive Learning for Class Disambiguation

TransFG augments the cross-entropy objective with a batch-wise contrastive loss designed to separate inter-class representations:

  • For BB samples in a batch, let ziz_i be the L2-normalized [CLS] features, and yiy_i the corresponding class labels.
  • Cosine similarity is defined as Sim(zi,zj)=zizj\text{Sim}(z_i, z_j) = z_i^\top z_j.
  • The contrastive loss is

Lcontrast=1B2i=1B[j:yj=yi(1Sim(zi,zj))+j:yjyimax(Sim(zi,zj)α,0)]L_{\text{contrast}} = \frac{1}{B^2} \sum_{i=1}^B\left[ \sum_{j: y_j = y_i}(1 - \text{Sim}(z_i, z_j)) + \sum_{j: y_j \neq y_i}\max(\text{Sim}(z_i, z_j) - \alpha, 0) \right]

with margin α=0.4\alpha=0.4.

  • The total training loss is L=Lcross-entropy+LcontrastL=L_{\text{cross-entropy}} + L_{\text{contrast}}.
  • Positive pairs are samples sharing a class label; negative pairs are those from different classes, with only "hard negatives" (similarity above α\alpha) contributing.

Ablation studies indicate the margin hyperparameter α=0.4\alpha=0.4 yields best accuracy on CUB-200-2011.

4. Training Pipeline and Implementation

  • Data Augmentation: All images are resized to 448×448448 \times 448 (or 304×304304 \times 304 for iNat2017) with standard augmentations (random crop, horizontal flip, color jitter), using canonical per-channel normalization. No additional part-specific augmentation is applied.
  • Optimization: Models are trained with SGD (momentum $0.9$, batch size $16$), cosine-annealing learning rate schedule, initial LR $0.03$ ($0.003$ for Stanford Dogs, $0.01$ for iNat2017), and canonical ViT-B_16 weight decay and dropout ($0.1$ each).
  • Pretraining and Hardware: The ViT-B_16 backbone is initialized from ImageNet-21k pretrained weights. Training is performed on 4 NVIDIA Tesla V100 GPUs using PyTorch and NVIDIA Apex for mixed precision.

5. Empirical Results and Benchmarking

TransFG establishes new state-of-the-art results across five FGVC benchmarks:

Dataset Prior SOTA(%) ViT-B_16(%) TransFG(%)
CUB-200-2011 PMG/ResNet-50: 89.6 90.3 91.7
Stanford Cars API-Net/ResNet-50: 95.3 93.7 94.8
Stanford Dogs API-Net/DenseNet-161:90.3 91.7 92.3
NABirds FixSENet-154: 89.2 89.9 90.8
iNat2017 IncResNetV2: 67.3 68.7 71.7

Ablation studies on CUB-200-2011 show incremental gains for overlapping patch splits (+0.2%+0.2\%), addition of the Part Selection Module (+0.5%+0.5\%), and the contrastive loss (+0.5%+0.5\%). The default K=H=12K=H=12 gives robust performance; empirically, K[8,16]K\in[8,16] produces similar results.

6. Part Localization and Qualitative Insights

Integrated attention maps generated by TransFG consistently highlight genuinely discriminative object parts. Visualizations on CUB focus on the head, wings, and tail of birds; on Stanford Dogs, attention is directed to the ears, eyes, and muzzle; on Stanford Cars, discriminative areas such as headlights, grilles, and rooflines are selected. On NABirds, relevant parts are sharply distinguished from background clutter. The explicit attention-driven part selection and relational modeling are shown to enhance both interpretability and discrimination, often localizing object parts that constitute only small fractions of the image.

7. Architectural Summary and Significance

TransFG combines (1) overlapping patch splitting for finer granularity, (2) integrated self-attention over multiple layers and heads to generate part heatmaps, (3) explicit selection and relational modeling of top-K discriminative patches using an additional transformer layer, and (4) a batch-wise contrastive loss with a fixed margin. This design results in consistent improvements of 1.03.0%1.0\text{–}3.0\% over baseline ViT and prior CNN-based SOTA for FGVC, achieved without complex region proposal or multi-stage pipelines (He et al., 2021).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to TransFG.