H3Former: Hypergraph-Based Visual Classification

Updated 9 January 2026

The paper introduces H3Former, a framework that leverages hypergraph-based semantic token aggregation and hyperbolic hierarchical contrastive loss to distinguish subtle inter-class differences in fine-grained visual tasks.
It employs a Swin-B Transformer backbone with multi-scale context generation and a Semantic-Aware Aggregation Module (SAAM) to capture high-order semantic dependencies from regional features.
Experimental evaluations on benchmark FGVC datasets demonstrate that integrating SAAM and HHCL significantly boosts classification accuracy, outperforming existing approaches.

H3Former is a framework designed for fine-grained visual classification (FGVC), targeting the challenge of distinguishing subtle inter-class differences and managing large intra-class variation. The architecture introduces hypergraph-based semantic-aware token aggregation using multi-scale context and incorporates a hyperbolic hierarchical contrastive loss to enforce semantic structure during representation learning. H3Former achieves state-of-the-art classification performance on multiple benchmark FGVC datasets using a single Transformer backbone and standard augmentation protocols (Zhang et al., 13 Nov 2025).

1. Architectural Overview

H3Former operates on input images resized to 448×448, utilizing a Swin-B Transformer backbone pretrained on ImageNet (22K for CUB/NA-Birds/Flowers, 1K for Dogs). Feature tokens are extracted at each of the four hierarchically organized Swin stages:

$X_s \in \mathbb{R}^{N_s \times C_s},\quad s = 1,\ldots,4$

where $N_s$ and $C_s$ are the number of spatial patches and channel dimension at stage $s$ , respectively.

The dataflow proceeds as follows:

Multi-scale Context Generation Module (CGM): Each $X_s$ $X_{s}$ is aggregated into three context vectors:
- $f_s^{avg} = \text{Linear}(\text{AvgPool}(X_s))$
- $f_s^{max} = \text{Linear}(\text{MaxPool}(X_s))$
- $f_s^{attn} = \text{Linear}\left(\sum_i V_s[i] X_s[i]\right)$ , where $V_s$ is derived from self-attention.
The twelve vectors ($4$ stages $\times$ $3$ stats) are concatenated:

$F = [f_1^{avg}, f_1^{max}, f_1^{attn}, \ldots, f_4^{attn}] \in \mathbb{R}^{12 \times C}$

$F$ is input to the Semantic-Aware Aggregation Module (SAAM), which builds and convolves a multi-scale weighted hypergraph over token features to yield region-level aggregates.
A gated residual fusion combines the original and hypergraph-refined tokens, followed by global pooling and a classifier.
Simultaneously, M region features are extracted for construction of a semantic hierarchy and computation of hyperbolic hierarchical contrastive loss (HHCL).

2. Semantic-Aware Aggregation Module (SAAM)

2.1 Multi-scale Weighted Hypergraph Construction

The context matrix $F$ is split into $M$ groups along the channel axis: $F_{(m)} \in \mathbb{R}^{(12/M) \times C}$ for $m = 1, \ldots, M$ . Each group is mapped by a shared MLP $\phi(\cdot)$ , then combined with a learnable prototype $P_m$ to produce hyperedge prototypes:

$K_m = \phi(F_{(m)}) + P_m, \quad K_m \in \mathbb{R}^{d_k}$

Final-stage tokens $X \in \mathbb{R}^{N \times C}$ are projected to $Q \in \mathbb{R}^{N \times d_k}$ using a learned matrix $W_q$ , and token-hyperedge affinity is softmax-normalized:

$A_{i, m} = \frac{\exp(Q_i^\top K_m / \sqrt{d_k})}{\sum_{m'=1}^M \exp(Q_i^\top K_{m'} / \sqrt{d_k})}, \quad A \in \mathbb{R}^{N \times M}$

2.2 Hypergraph Convolution and Residual Fusion

The hypergraph convolution propagates information via a vertex-edge-vertex (V→E→V) message passing:

Aggregate to hyperedges: $H_e = (A^\top X) W_e \in \mathbb{R}^{M \times C}$
Propagate back to nodes: $X' = (A H_e) W_v \in \mathbb{R}^{N \times C}$
Residual combination with a learned gate $g \in \mathbb{R}^N$ : $\hat{X} = X + (g \odot X')$

This mechanism enables high-order semantic dependencies to be captured across spatial regions for region-level discrimination.

3. Hyperbolic Hierarchical Contrastive Loss (HHCL)

3.1 Hyperbolic Embedding via the Lorentz Model

Region features (from SAAM) are embedded on the $d$ -dimensional hyperboloid:

$\mathbb{L}^d = \{ x \in \mathbb{R}^{d+1} : \langle x, x \rangle_\mathbb{L} = -1, x_0 > 0 \}$

with Lorentzian inner product $\langle x, y \rangle_\mathbb{L} = -x_0 y_0 + x_s^\top y_s$ . The exponential map from Euclidean $z$ is:

$\text{exp}_0(z) = (\sqrt{1 + \|z\|^2}, z)$

The hyperbolic distance is defined as $d_\mathbb{L}(x, y) = \operatorname{arcosh}(-\langle x, y \rangle_\mathbb{L})$ .

3.2 Semantic Hierarchy and Contrastive Losses

Region features are recursively merged using a similarity-based agglomeration operator to form a semantic tree of $L$ levels. The joint distance between features $z_i, z_j$ is:

$D_{i,j} = \|z_i - z_j\|_2 + \lambda \cdot d_\mathbb{L}(\text{exp}_0(z_i), \text{exp}_0(z_j))$

At each tree level, a supervised contrastive loss is applied:

$\mathcal{L}_{con} = -\sum_i \frac{1}{|P(i)|} \sum_{p \in P(i)} \log \frac{\exp(-D_{i,p}/\tau)}{\sum_{a\neq i} \exp(-D_{i,a}/\tau)}$

where $P(i)$ denotes positive instances (same class), $\tau$ is a temperature parameter.

A hyperbolic partial-order loss enforces locality between parents and children in the semantic tree:

$\mathcal{L}_{hpop} = \frac{1}{L-1} \sum_{\ell=1}^{L-1} \mathrm{ReLU}[d_\mathbb{L}(\mathrm{exp}_0(H_i^{\ell+1}), \mathrm{exp}_0(H_i^\ell))]$

The full HHCL is $\mathcal{L}_{HHCL} = \mathcal{L}_{con} + \beta \mathcal{L}_{hpop}$ . The final objective is $\mathcal{L}_{total} = \mathcal{L}_{CE} + \alpha \mathcal{L}_{HHCL}$ , combining classification and hierarchy-based structure loss.

4. Experimental Evaluation

H3Former is evaluated on four FGVC benchmarks: CUB-200-2011, NA-Birds, Stanford-Dogs, and Oxford-Flowers-101. Images are processed into $14 \times 14$ tokens (stride 14). The Swin-B backbone is configured with dimensions $\{128,256,512,1024\}$ , heads $\{4,8,16,32\}$ across 12 layers. SAAM uses $M=16$ hyperedges, prototype dimension $d_k=64$ ; HHCL uses curvature $\kappa=0.1$ , $\lambda=1.0$ , $\tau=0.1$ , $\beta=0.1$ , and $L=4$ hierarchical levels.

Top-1 Accuracy Comparisons on CUB-200-2011

Method	CUB-200-2011
TransFG (Swin-B)	91.7%
IELT (ViT-B)	91.8%
SR-GNN (Xception)	91.9%
H³Former	92.7%

Similar performance gains (+0.7% to +4.7%) are reported on the other datasets, with H3Former achieving 99.7% on Flowers-101.

Ablation Results

Model Variant	CUB	Dogs
w/o SAAM & HHCL	90.9%	91.1%
+SAAM only	92.5%	95.2%
+HHCL only	91.2%	92.6%
SAAM + HHCL	92.7%	95.8%

The ablations highlight the contribution of both SAAM and HHCL to classification accuracy.

5. Visualization and Qualitative Insights

Hyperedge visualizations: By projecting affinity values $A_{:,m}$ back onto the spatial layout, heatmaps demonstrate that different hyperedges consistently attend to semantically meaningful parts (e.g., beak, wing, tail, feet of birds) across instances, evidencing robust part discovery even under occlusion.

t-SNE visualizations: When only HHCL is applied, embedding clusters show improved compactness. With only SAAM, discrimination between regions is enhanced. Using both modules yields well-separated, compact clusters that align with class boundaries, signifying improved representation learning for fine-grained discrimination.

6. Technical Innovations and Significance

H3Former’s core innovation lies in two interlocking mechanisms:

Semantic-Aware Aggregation Module (SAAM): A dynamic, multi-scale, weighted hypergraph mechanism that consolidates token representations into region-level features through high-order message passing, enabling richer semantic contextualization.
Hyperbolic Hierarchical Contrastive Loss (HHCL): A dual-space (Euclidean and Lorentz hyperbolic) contrastive approach that leverages a semantic part hierarchy to simultaneously increase inter-class separation, enforce intra-class consistency, and preserve part–whole relationships.

This approach circumvents the limitations of previous feature selection and region proposal pipelines, offering a framework for token-to-region aggregation that is both semantically expressive and computationally tractable. The design achieves state-of-the-art results on established FGVC datasets using standard backbones and data augmentation (Zhang et al., 13 Nov 2025).

7. Hyperparameter Choices and Implementation Details

Key implementation hyperparameters:

Backbone: Swin-B Transformer, {128,256,512,1024}-dim embeddings, {4,8,16,32} heads, 12 layers.
SAAM: $M=16$ hyperedges, prototype dimension $d_k=64$ , MLP $\phi(\cdot)$ with 2 layers.
HHCL: Lorentz curvature $\kappa = 0.1$ , balancing parameters $\lambda = 1.0$ , $\tau = 0.1$ , $\beta = 0.1$ , hierarchy depth $L=4$ (ratios 16, 8, 4, 1).

Performance is maximized with these settings, as confirmed via ablation. The entire method is compatible with a single backbone and requires no explicit part or region annotations, indicating efficiency in practical deployment (Zhang et al., 13 Nov 2025).

PDF Markdown Chat (Pro)

References (1)

H3Former: Hypergraph-based Semantic-Aware Aggregation via Hyperbolic Hierarchical Contrastive Loss for Fine-Grained Visual Classification (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to H3Former.