TransFG: Transformer-Based FGVC
- TransFG is a transformer-based architecture for fine-grained visual classification that uses self-attention to localize discriminative object parts.
- It integrates overlapping patch splitting, multi-head attention, and top-K patch selection to enhance part discovery and relational modeling.
- By combining contrastive and cross-entropy losses, TransFG improves classification accuracy by up to 3% over baseline ViT on multiple benchmarks.
TransFG is a transformer-based architecture for fine-grained visual classification (FGVC) that introduces a self-attention guided pipeline for discriminative part localization and relational modeling. By leveraging multi-layer, multi-head attention and a contrastive loss strategy, TransFG achieves state-of-the-art performance on FGVC benchmarks while maintaining architectural simplicity and end-to-end trainability (He et al., 2021).
1. Baseline: Vision Transformer for Fine-Grained Recognition
TransFG builds on the Vision Transformer (ViT) framework, specifically the ViT-B_16 variant. The standard ViT baseline processes input images (e.g., ) by splitting them into overlapping patches of size using stride (in TransFG, , ). The number of patches is computed as where and . Each patch is embedded via a linear projection, appended with positional encodings , and a learned [CLS] token.
The transformer encoder consists of stacked blocks, each with multi-head self-attention (MSA) and MLP modules, using pre-norm and residual connections. For ViT-B_16, layers, , heads, and an MLP size of $3072$ are typical. The final [CLS] token output after layers is used for global classification via a linear head. Baseline ViT-B_16 fine-tuned for FGVC achieves the following accuracies:
| Dataset | ViT-B_16 (%) |
|---|---|
| CUB-200-2011 | 90.3 |
| Stanford Cars | 93.7 |
| Stanford Dogs | 91.7 |
| NABirds | 89.9 |
| iNat2017 | 68.7 |
2. Part Selection Module: Discriminative Patch Localization
The core innovation in TransFG is the Part Selection Module (PSM), which operationalizes transformer attention weights for fine-grained part discovery.
- Attention Extraction: At each of the first layers and attention heads, the attention matrix is processed; the [CLS] token’s attention to all image patches is recorded as with .
- Integrated Attention Map: The model aggregates attention across all layers and heads into using . This can be reshaped into an attention heatmap over spatial locations.
- Top-K Patch Selection: The indices of the largest values in identify the most discriminative patches; is set equal to the number of heads ().
- Local Sequence Construction: The local token sequence is formed by concatenating the [CLS] token and the selected patch tokens immediately before the final transformer layer.
A single transformer block operates on this sequence, explicitly modeling interactions among the global representation ([CLS]) and selected discriminative parts.
3. Loss Function: Contrastive Learning for Class Disambiguation
TransFG augments the cross-entropy objective with a batch-wise contrastive loss designed to separate inter-class representations:
- For samples in a batch, let be the L2-normalized [CLS] features, and the corresponding class labels.
- Cosine similarity is defined as .
- The contrastive loss is
with margin .
- The total training loss is .
- Positive pairs are samples sharing a class label; negative pairs are those from different classes, with only "hard negatives" (similarity above ) contributing.
Ablation studies indicate the margin hyperparameter yields best accuracy on CUB-200-2011.
4. Training Pipeline and Implementation
- Data Augmentation: All images are resized to (or for iNat2017) with standard augmentations (random crop, horizontal flip, color jitter), using canonical per-channel normalization. No additional part-specific augmentation is applied.
- Optimization: Models are trained with SGD (momentum $0.9$, batch size $16$), cosine-annealing learning rate schedule, initial LR $0.03$ ($0.003$ for Stanford Dogs, $0.01$ for iNat2017), and canonical ViT-B_16 weight decay and dropout ($0.1$ each).
- Pretraining and Hardware: The ViT-B_16 backbone is initialized from ImageNet-21k pretrained weights. Training is performed on 4 NVIDIA Tesla V100 GPUs using PyTorch and NVIDIA Apex for mixed precision.
5. Empirical Results and Benchmarking
TransFG establishes new state-of-the-art results across five FGVC benchmarks:
| Dataset | Prior SOTA(%) | ViT-B_16(%) | TransFG(%) |
|---|---|---|---|
| CUB-200-2011 | PMG/ResNet-50: 89.6 | 90.3 | 91.7 |
| Stanford Cars | API-Net/ResNet-50: 95.3 | 93.7 | 94.8 |
| Stanford Dogs | API-Net/DenseNet-161:90.3 | 91.7 | 92.3 |
| NABirds | FixSENet-154: 89.2 | 89.9 | 90.8 |
| iNat2017 | IncResNetV2: 67.3 | 68.7 | 71.7 |
Ablation studies on CUB-200-2011 show incremental gains for overlapping patch splits (), addition of the Part Selection Module (), and the contrastive loss (). The default gives robust performance; empirically, produces similar results.
6. Part Localization and Qualitative Insights
Integrated attention maps generated by TransFG consistently highlight genuinely discriminative object parts. Visualizations on CUB focus on the head, wings, and tail of birds; on Stanford Dogs, attention is directed to the ears, eyes, and muzzle; on Stanford Cars, discriminative areas such as headlights, grilles, and rooflines are selected. On NABirds, relevant parts are sharply distinguished from background clutter. The explicit attention-driven part selection and relational modeling are shown to enhance both interpretability and discrimination, often localizing object parts that constitute only small fractions of the image.
7. Architectural Summary and Significance
TransFG combines (1) overlapping patch splitting for finer granularity, (2) integrated self-attention over multiple layers and heads to generate part heatmaps, (3) explicit selection and relational modeling of top-K discriminative patches using an additional transformer layer, and (4) a batch-wise contrastive loss with a fixed margin. This design results in consistent improvements of over baseline ViT and prior CNN-based SOTA for FGVC, achieved without complex region proposal or multi-stage pipelines (He et al., 2021).