H3Former: Hypergraph-Based Visual Classification
- The paper introduces H3Former, a framework that leverages hypergraph-based semantic token aggregation and hyperbolic hierarchical contrastive loss to distinguish subtle inter-class differences in fine-grained visual tasks.
- It employs a Swin-B Transformer backbone with multi-scale context generation and a Semantic-Aware Aggregation Module (SAAM) to capture high-order semantic dependencies from regional features.
- Experimental evaluations on benchmark FGVC datasets demonstrate that integrating SAAM and HHCL significantly boosts classification accuracy, outperforming existing approaches.
H3Former is a framework designed for fine-grained visual classification (FGVC), targeting the challenge of distinguishing subtle inter-class differences and managing large intra-class variation. The architecture introduces hypergraph-based semantic-aware token aggregation using multi-scale context and incorporates a hyperbolic hierarchical contrastive loss to enforce semantic structure during representation learning. H3Former achieves state-of-the-art classification performance on multiple benchmark FGVC datasets using a single Transformer backbone and standard augmentation protocols (Zhang et al., 13 Nov 2025).
1. Architectural Overview
H3Former operates on input images resized to 448×448, utilizing a Swin-B Transformer backbone pretrained on ImageNet (22K for CUB/NA-Birds/Flowers, 1K for Dogs). Feature tokens are extracted at each of the four hierarchically organized Swin stages:
where and are the number of spatial patches and channel dimension at stage , respectively.
The dataflow proceeds as follows:
- Multi-scale Context Generation Module (CGM): Each is aggregated into three context vectors:
- , where is derived from self-attention.
- The twelve vectors ($4$ stages $3$ stats) are concatenated:
- is input to the Semantic-Aware Aggregation Module (SAAM), which builds and convolves a multi-scale weighted hypergraph over token features to yield region-level aggregates.
- A gated residual fusion combines the original and hypergraph-refined tokens, followed by global pooling and a classifier.
- Simultaneously, M region features are extracted for construction of a semantic hierarchy and computation of hyperbolic hierarchical contrastive loss (HHCL).
2. Semantic-Aware Aggregation Module (SAAM)
2.1 Multi-scale Weighted Hypergraph Construction
The context matrix is split into groups along the channel axis: for . Each group is mapped by a shared MLP , then combined with a learnable prototype to produce hyperedge prototypes:
Final-stage tokens are projected to using a learned matrix , and token-hyperedge affinity is softmax-normalized:
2.2 Hypergraph Convolution and Residual Fusion
The hypergraph convolution propagates information via a vertex-edge-vertex (V→E→V) message passing:
- Aggregate to hyperedges:
- Propagate back to nodes:
- Residual combination with a learned gate :
This mechanism enables high-order semantic dependencies to be captured across spatial regions for region-level discrimination.
3. Hyperbolic Hierarchical Contrastive Loss (HHCL)
3.1 Hyperbolic Embedding via the Lorentz Model
Region features (from SAAM) are embedded on the -dimensional hyperboloid:
with Lorentzian inner product . The exponential map from Euclidean is:
The hyperbolic distance is defined as .
3.2 Semantic Hierarchy and Contrastive Losses
Region features are recursively merged using a similarity-based agglomeration operator to form a semantic tree of levels. The joint distance between features is:
At each tree level, a supervised contrastive loss is applied:
where denotes positive instances (same class), is a temperature parameter.
A hyperbolic partial-order loss enforces locality between parents and children in the semantic tree:
The full HHCL is . The final objective is , combining classification and hierarchy-based structure loss.
4. Experimental Evaluation
H3Former is evaluated on four FGVC benchmarks: CUB-200-2011, NA-Birds, Stanford-Dogs, and Oxford-Flowers-101. Images are processed into tokens (stride 14). The Swin-B backbone is configured with dimensions , heads across 12 layers. SAAM uses hyperedges, prototype dimension ; HHCL uses curvature , , , , and hierarchical levels.
Top-1 Accuracy Comparisons on CUB-200-2011
| Method | CUB-200-2011 |
|---|---|
| TransFG (Swin-B) | 91.7% |
| IELT (ViT-B) | 91.8% |
| SR-GNN (Xception) | 91.9% |
| H³Former | 92.7% |
Similar performance gains (+0.7% to +4.7%) are reported on the other datasets, with H3Former achieving 99.7% on Flowers-101.
Ablation Results
| Model Variant | CUB | Dogs |
|---|---|---|
| w/o SAAM & HHCL | 90.9% | 91.1% |
| +SAAM only | 92.5% | 95.2% |
| +HHCL only | 91.2% | 92.6% |
| SAAM + HHCL | 92.7% | 95.8% |
The ablations highlight the contribution of both SAAM and HHCL to classification accuracy.
5. Visualization and Qualitative Insights
Hyperedge visualizations: By projecting affinity values back onto the spatial layout, heatmaps demonstrate that different hyperedges consistently attend to semantically meaningful parts (e.g., beak, wing, tail, feet of birds) across instances, evidencing robust part discovery even under occlusion.
t-SNE visualizations: When only HHCL is applied, embedding clusters show improved compactness. With only SAAM, discrimination between regions is enhanced. Using both modules yields well-separated, compact clusters that align with class boundaries, signifying improved representation learning for fine-grained discrimination.
6. Technical Innovations and Significance
H3Former’s core innovation lies in two interlocking mechanisms:
- Semantic-Aware Aggregation Module (SAAM): A dynamic, multi-scale, weighted hypergraph mechanism that consolidates token representations into region-level features through high-order message passing, enabling richer semantic contextualization.
- Hyperbolic Hierarchical Contrastive Loss (HHCL): A dual-space (Euclidean and Lorentz hyperbolic) contrastive approach that leverages a semantic part hierarchy to simultaneously increase inter-class separation, enforce intra-class consistency, and preserve part–whole relationships.
This approach circumvents the limitations of previous feature selection and region proposal pipelines, offering a framework for token-to-region aggregation that is both semantically expressive and computationally tractable. The design achieves state-of-the-art results on established FGVC datasets using standard backbones and data augmentation (Zhang et al., 13 Nov 2025).
7. Hyperparameter Choices and Implementation Details
Key implementation hyperparameters:
- Backbone: Swin-B Transformer, {128,256,512,1024}-dim embeddings, {4,8,16,32} heads, 12 layers.
- SAAM: hyperedges, prototype dimension , MLP with 2 layers.
- HHCL: Lorentz curvature , balancing parameters , , , hierarchy depth (ratios 16, 8, 4, 1).
Performance is maximized with these settings, as confirmed via ablation. The entire method is compatible with a single backbone and requires no explicit part or region annotations, indicating efficiency in practical deployment (Zhang et al., 13 Nov 2025).