TransFG: Transformer-Based FGVC

Updated 12 March 2026

TransFG is a transformer-based architecture for fine-grained visual classification that uses self-attention to localize discriminative object parts.
It integrates overlapping patch splitting, multi-head attention, and top-K patch selection to enhance part discovery and relational modeling.
By combining contrastive and cross-entropy losses, TransFG improves classification accuracy by up to 3% over baseline ViT on multiple benchmarks.

TransFG is a transformer-based architecture for fine-grained visual classification (FGVC) that introduces a self-attention guided pipeline for discriminative part localization and relational modeling. By leveraging multi-layer, multi-head attention and a contrastive loss strategy, TransFG achieves state-of-the-art performance on FGVC benchmarks while maintaining architectural simplicity and end-to-end trainability (He et al., 2021).

1. Baseline: Vision Transformer for Fine-Grained Recognition

TransFG builds on the Vision Transformer (ViT) framework, specifically the ViT-B_16 variant. The standard ViT baseline processes input images $I \in \mathbb{R}^{H \times W \times C}$ (e.g., $448 \times 448 \times 3$ ) by splitting them into $N$ overlapping patches of size $P \times P$ using stride $S$ (in TransFG, $P=16$ , $S=12$ ). The number of patches is computed as $N = N_H \cdot N_W$ where $N_H = \lfloor (H-P+S)/S \rfloor$ and $N_W = \lfloor (W-P+S)/S \rfloor$ . Each patch $x_p^i \in \mathbb{R}^{P^2 C}$ is embedded via a linear projection, appended with positional encodings $E_{\text{pos}} \in \mathbb{R}^{(N+1)\times D}$ , and a learned [CLS] token.

The transformer encoder consists of $L$ stacked blocks, each with multi-head self-attention (MSA) and MLP modules, using pre-norm and residual connections. For ViT-B_16, $L=12$ layers, $D=768$ , $H=12$ heads, and an MLP size of $3072$ are typical. The final [CLS] token output after $L$ layers is used for global classification via a linear head. Baseline ViT-B_16 fine-tuned for FGVC achieves the following accuracies:

Dataset	ViT-B_16 (%)
CUB-200-2011	90.3
Stanford Cars	93.7
Stanford Dogs	91.7
NABirds	89.9
iNat2017	68.7

2. Part Selection Module: Discriminative Patch Localization

The core innovation in TransFG is the Part Selection Module (PSM), which operationalizes transformer attention weights for fine-grained part discovery.

Attention Extraction: At each of the first $L-1$ layers and $H$ attention heads, the attention matrix $A_l^h \in \mathbb{R}^{(N+1)\times(N+1)}$ is processed; the [CLS] token’s attention to all image patches is recorded as $a_l^h \in \mathbb{R}^N$ with $(a_l^h)_i = A_l^h[0, i]$ .
Integrated Attention Map: The model aggregates attention across all layers and heads into $A \in \mathbb{R}^N$ using $A_i = \sum_{l=1}^{L-1}\sum_{h=1}^{H} (a_l^h)_i$ . This can be reshaped into an attention heatmap over spatial locations.
Top-K Patch Selection: The indices $\{i_1,\ldots,i_K\}$ of the largest $K$ values in $A$ identify the most discriminative patches; $K$ is set equal to the number of heads ( $K=12$ ).
Local Sequence Construction: The local token sequence $Z_{\text{local}} = [Z_{L-1}^0; Z_{L-1}^{i_1};\ldots; Z_{L-1}^{i_K}]$ is formed by concatenating the [CLS] token and the selected patch tokens immediately before the final transformer layer.

A single transformer block operates on this sequence, explicitly modeling interactions among the global representation ([CLS]) and selected discriminative parts.

3. Loss Function: Contrastive Learning for Class Disambiguation

TransFG augments the cross-entropy objective with a batch-wise contrastive loss designed to separate inter-class representations:

For $B$ samples in a batch, let $z_i$ be the L2-normalized [CLS] features, and $y_i$ the corresponding class labels.
Cosine similarity is defined as $\text{Sim}(z_i, z_j) = z_i^\top z_j$ .
The contrastive loss is

$L_{\text{contrast}} = \frac{1}{B^2} \sum_{i=1}^B\left[ \sum_{j: y_j = y_i}(1 - \text{Sim}(z_i, z_j)) + \sum_{j: y_j \neq y_i}\max(\text{Sim}(z_i, z_j) - \alpha, 0) \right]$

with margin $\alpha=0.4$ .

The total training loss is $L=L_{\text{cross-entropy}} + L_{\text{contrast}}$ .
Positive pairs are samples sharing a class label; negative pairs are those from different classes, with only "hard negatives" (similarity above $\alpha$ ) contributing.

Ablation studies indicate the margin hyperparameter $\alpha=0.4$ yields best accuracy on CUB-200-2011.

4. Training Pipeline and Implementation

Data Augmentation: All images are resized to $448 \times 448$ (or $304 \times 304$ for iNat2017) with standard augmentations (random crop, horizontal flip, color jitter), using canonical per-channel normalization. No additional part-specific augmentation is applied.
Optimization: Models are trained with SGD (momentum $0.9$, batch size $16$), cosine-annealing learning rate schedule, initial LR $0.03$ ($0.003$ for Stanford Dogs, $0.01$ for iNat2017), and canonical ViT-B_16 weight decay and dropout ($0.1$ each).
Pretraining and Hardware: The ViT-B_16 backbone is initialized from ImageNet-21k pretrained weights. Training is performed on 4 NVIDIA Tesla V100 GPUs using PyTorch and NVIDIA Apex for mixed precision.

5. Empirical Results and Benchmarking

TransFG establishes new state-of-the-art results across five FGVC benchmarks:

Dataset	Prior SOTA(%)	ViT-B_16(%)	TransFG(%)
CUB-200-2011	PMG/ResNet-50: 89.6	90.3	91.7
Stanford Cars	API-Net/ResNet-50: 95.3	93.7	94.8
Stanford Dogs	API-Net/DenseNet-161:90.3	91.7	92.3
NABirds	FixSENet-154: 89.2	89.9	90.8
iNat2017	IncResNetV2: 67.3	68.7	71.7

Ablation studies on CUB-200-2011 show incremental gains for overlapping patch splits ( $+0.2\%$ ), addition of the Part Selection Module ( $+0.5\%$ ), and the contrastive loss ( $+0.5\%$ ). The default $K=H=12$ gives robust performance; empirically, $K\in[8,16]$ produces similar results.

6. Part Localization and Qualitative Insights

Integrated attention maps generated by TransFG consistently highlight genuinely discriminative object parts. Visualizations on CUB focus on the head, wings, and tail of birds; on Stanford Dogs, attention is directed to the ears, eyes, and muzzle; on Stanford Cars, discriminative areas such as headlights, grilles, and rooflines are selected. On NABirds, relevant parts are sharply distinguished from background clutter. The explicit attention-driven part selection and relational modeling are shown to enhance both interpretability and discrimination, often localizing object parts that constitute only small fractions of the image.

7. Architectural Summary and Significance

TransFG combines (1) overlapping patch splitting for finer granularity, (2) integrated self-attention over multiple layers and heads to generate part heatmaps, (3) explicit selection and relational modeling of top-K discriminative patches using an additional transformer layer, and (4) a batch-wise contrastive loss with a fixed margin. This design results in consistent improvements of $1.0\text{–}3.0\%$ over baseline ViT and prior CNN-based SOTA for FGVC, achieved without complex region proposal or multi-stage pipelines (He et al., 2021).

Markdown Report Issue Upgrade to Chat

References (1)

TransFG: A Transformer Architecture for Fine-grained Recognition (2021)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to TransFG.