Papers
Topics
Authors
Recent
2000 character limit reached

H3Former: Hypergraph-Based Visual Classification

Updated 9 January 2026
  • The paper introduces H3Former, a framework that leverages hypergraph-based semantic token aggregation and hyperbolic hierarchical contrastive loss to distinguish subtle inter-class differences in fine-grained visual tasks.
  • It employs a Swin-B Transformer backbone with multi-scale context generation and a Semantic-Aware Aggregation Module (SAAM) to capture high-order semantic dependencies from regional features.
  • Experimental evaluations on benchmark FGVC datasets demonstrate that integrating SAAM and HHCL significantly boosts classification accuracy, outperforming existing approaches.

H3Former is a framework designed for fine-grained visual classification (FGVC), targeting the challenge of distinguishing subtle inter-class differences and managing large intra-class variation. The architecture introduces hypergraph-based semantic-aware token aggregation using multi-scale context and incorporates a hyperbolic hierarchical contrastive loss to enforce semantic structure during representation learning. H3Former achieves state-of-the-art classification performance on multiple benchmark FGVC datasets using a single Transformer backbone and standard augmentation protocols (Zhang et al., 13 Nov 2025).

1. Architectural Overview

H3Former operates on input images resized to 448×448, utilizing a Swin-B Transformer backbone pretrained on ImageNet (22K for CUB/NA-Birds/Flowers, 1K for Dogs). Feature tokens are extracted at each of the four hierarchically organized Swin stages:

XsRNs×Cs,s=1,,4X_s \in \mathbb{R}^{N_s \times C_s},\quad s = 1,\ldots,4

where NsN_s and CsC_s are the number of spatial patches and channel dimension at stage ss, respectively.

The dataflow proceeds as follows:

  1. Multi-scale Context Generation Module (CGM): Each XsX_s is aggregated into three context vectors:
    • fsavg=Linear(AvgPool(Xs))f_s^{avg} = \text{Linear}(\text{AvgPool}(X_s))
    • fsmax=Linear(MaxPool(Xs))f_s^{max} = \text{Linear}(\text{MaxPool}(X_s))
    • fsattn=Linear(iVs[i]Xs[i])f_s^{attn} = \text{Linear}\left(\sum_i V_s[i] X_s[i]\right), where VsV_s is derived from self-attention.
  2. The twelve vectors ($4$ stages ×\times $3$ stats) are concatenated:

F=[f1avg,f1max,f1attn,,f4attn]R12×CF = [f_1^{avg}, f_1^{max}, f_1^{attn}, \ldots, f_4^{attn}] \in \mathbb{R}^{12 \times C}

  1. FF is input to the Semantic-Aware Aggregation Module (SAAM), which builds and convolves a multi-scale weighted hypergraph over token features to yield region-level aggregates.
  2. A gated residual fusion combines the original and hypergraph-refined tokens, followed by global pooling and a classifier.
  3. Simultaneously, M region features are extracted for construction of a semantic hierarchy and computation of hyperbolic hierarchical contrastive loss (HHCL).

2. Semantic-Aware Aggregation Module (SAAM)

2.1 Multi-scale Weighted Hypergraph Construction

The context matrix FF is split into MM groups along the channel axis: F(m)R(12/M)×CF_{(m)} \in \mathbb{R}^{(12/M) \times C} for m=1,,Mm = 1, \ldots, M. Each group is mapped by a shared MLP ϕ()\phi(\cdot), then combined with a learnable prototype PmP_m to produce hyperedge prototypes:

Km=ϕ(F(m))+Pm,KmRdkK_m = \phi(F_{(m)}) + P_m, \quad K_m \in \mathbb{R}^{d_k}

Final-stage tokens XRN×CX \in \mathbb{R}^{N \times C} are projected to QRN×dkQ \in \mathbb{R}^{N \times d_k} using a learned matrix WqW_q, and token-hyperedge affinity is softmax-normalized:

Ai,m=exp(QiKm/dk)m=1Mexp(QiKm/dk),ARN×MA_{i, m} = \frac{\exp(Q_i^\top K_m / \sqrt{d_k})}{\sum_{m'=1}^M \exp(Q_i^\top K_{m'} / \sqrt{d_k})}, \quad A \in \mathbb{R}^{N \times M}

2.2 Hypergraph Convolution and Residual Fusion

The hypergraph convolution propagates information via a vertex-edge-vertex (V→E→V) message passing:

  1. Aggregate to hyperedges: He=(AX)WeRM×CH_e = (A^\top X) W_e \in \mathbb{R}^{M \times C}
  2. Propagate back to nodes: X=(AHe)WvRN×CX' = (A H_e) W_v \in \mathbb{R}^{N \times C}
  3. Residual combination with a learned gate gRNg \in \mathbb{R}^N: X^=X+(gX)\hat{X} = X + (g \odot X')

This mechanism enables high-order semantic dependencies to be captured across spatial regions for region-level discrimination.

3. Hyperbolic Hierarchical Contrastive Loss (HHCL)

3.1 Hyperbolic Embedding via the Lorentz Model

Region features (from SAAM) are embedded on the dd-dimensional hyperboloid:

Ld={xRd+1:x,xL=1,x0>0}\mathbb{L}^d = \{ x \in \mathbb{R}^{d+1} : \langle x, x \rangle_\mathbb{L} = -1, x_0 > 0 \}

with Lorentzian inner product x,yL=x0y0+xsys\langle x, y \rangle_\mathbb{L} = -x_0 y_0 + x_s^\top y_s. The exponential map from Euclidean zz is:

exp0(z)=(1+z2,z)\text{exp}_0(z) = (\sqrt{1 + \|z\|^2}, z)

The hyperbolic distance is defined as dL(x,y)=arcosh(x,yL)d_\mathbb{L}(x, y) = \operatorname{arcosh}(-\langle x, y \rangle_\mathbb{L}).

3.2 Semantic Hierarchy and Contrastive Losses

Region features are recursively merged using a similarity-based agglomeration operator to form a semantic tree of LL levels. The joint distance between features zi,zjz_i, z_j is:

Di,j=zizj2+λdL(exp0(zi),exp0(zj))D_{i,j} = \|z_i - z_j\|_2 + \lambda \cdot d_\mathbb{L}(\text{exp}_0(z_i), \text{exp}_0(z_j))

At each tree level, a supervised contrastive loss is applied:

Lcon=i1P(i)pP(i)logexp(Di,p/τ)aiexp(Di,a/τ)\mathcal{L}_{con} = -\sum_i \frac{1}{|P(i)|} \sum_{p \in P(i)} \log \frac{\exp(-D_{i,p}/\tau)}{\sum_{a\neq i} \exp(-D_{i,a}/\tau)}

where P(i)P(i) denotes positive instances (same class), τ\tau is a temperature parameter.

A hyperbolic partial-order loss enforces locality between parents and children in the semantic tree:

Lhpop=1L1=1L1ReLU[dL(exp0(Hi+1),exp0(Hi))]\mathcal{L}_{hpop} = \frac{1}{L-1} \sum_{\ell=1}^{L-1} \mathrm{ReLU}[d_\mathbb{L}(\mathrm{exp}_0(H_i^{\ell+1}), \mathrm{exp}_0(H_i^\ell))]

The full HHCL is LHHCL=Lcon+βLhpop\mathcal{L}_{HHCL} = \mathcal{L}_{con} + \beta \mathcal{L}_{hpop}. The final objective is Ltotal=LCE+αLHHCL\mathcal{L}_{total} = \mathcal{L}_{CE} + \alpha \mathcal{L}_{HHCL}, combining classification and hierarchy-based structure loss.

4. Experimental Evaluation

H3Former is evaluated on four FGVC benchmarks: CUB-200-2011, NA-Birds, Stanford-Dogs, and Oxford-Flowers-101. Images are processed into 14×1414 \times 14 tokens (stride 14). The Swin-B backbone is configured with dimensions {128,256,512,1024}\{128,256,512,1024\}, heads {4,8,16,32}\{4,8,16,32\} across 12 layers. SAAM uses M=16M=16 hyperedges, prototype dimension dk=64d_k=64; HHCL uses curvature κ=0.1\kappa=0.1, λ=1.0\lambda=1.0, τ=0.1\tau=0.1, β=0.1\beta=0.1, and L=4L=4 hierarchical levels.

Top-1 Accuracy Comparisons on CUB-200-2011

Method CUB-200-2011
TransFG (Swin-B) 91.7%
IELT (ViT-B) 91.8%
SR-GNN (Xception) 91.9%
H³Former 92.7%

Similar performance gains (+0.7% to +4.7%) are reported on the other datasets, with H3Former achieving 99.7% on Flowers-101.

Ablation Results

Model Variant CUB Dogs
w/o SAAM & HHCL 90.9% 91.1%
+SAAM only 92.5% 95.2%
+HHCL only 91.2% 92.6%
SAAM + HHCL 92.7% 95.8%

The ablations highlight the contribution of both SAAM and HHCL to classification accuracy.

5. Visualization and Qualitative Insights

Hyperedge visualizations: By projecting affinity values A:,mA_{:,m} back onto the spatial layout, heatmaps demonstrate that different hyperedges consistently attend to semantically meaningful parts (e.g., beak, wing, tail, feet of birds) across instances, evidencing robust part discovery even under occlusion.

t-SNE visualizations: When only HHCL is applied, embedding clusters show improved compactness. With only SAAM, discrimination between regions is enhanced. Using both modules yields well-separated, compact clusters that align with class boundaries, signifying improved representation learning for fine-grained discrimination.

6. Technical Innovations and Significance

H3Former’s core innovation lies in two interlocking mechanisms:

  1. Semantic-Aware Aggregation Module (SAAM): A dynamic, multi-scale, weighted hypergraph mechanism that consolidates token representations into region-level features through high-order message passing, enabling richer semantic contextualization.
  2. Hyperbolic Hierarchical Contrastive Loss (HHCL): A dual-space (Euclidean and Lorentz hyperbolic) contrastive approach that leverages a semantic part hierarchy to simultaneously increase inter-class separation, enforce intra-class consistency, and preserve part–whole relationships.

This approach circumvents the limitations of previous feature selection and region proposal pipelines, offering a framework for token-to-region aggregation that is both semantically expressive and computationally tractable. The design achieves state-of-the-art results on established FGVC datasets using standard backbones and data augmentation (Zhang et al., 13 Nov 2025).

7. Hyperparameter Choices and Implementation Details

Key implementation hyperparameters:

  • Backbone: Swin-B Transformer, {128,256,512,1024}-dim embeddings, {4,8,16,32} heads, 12 layers.
  • SAAM: M=16M=16 hyperedges, prototype dimension dk=64d_k=64, MLP ϕ()\phi(\cdot) with 2 layers.
  • HHCL: Lorentz curvature κ=0.1\kappa = 0.1, balancing parameters λ=1.0\lambda = 1.0, τ=0.1\tau = 0.1, β=0.1\beta = 0.1, hierarchy depth L=4L=4 (ratios 16, 8, 4, 1).

Performance is maximized with these settings, as confirmed via ablation. The entire method is compatible with a single backbone and requires no explicit part or region annotations, indicating efficiency in practical deployment (Zhang et al., 13 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to H3Former.