Papers
Topics
Authors
Recent
2000 character limit reached

HiSCE: Hierarchy-Sibling Smoothed Cross-Entropy

Updated 31 December 2025
  • HiSCE is a hierarchy-aware loss that applies sibling-sensitive smoothing to promote taxonomic consistency in vision-language models.
  • It integrates seamlessly with fine-tuning frameworks using low-rank adaptations, enabling robust predictions across multi-level classifications.
  • Empirical results show improved Full-Path Accuracy and reduced Tree-based Inconsistency Error on benchmarks like CUB-200-2011 and FGVC-Aircraft.

Hierarchy-Sibling Smoothed Cross-Entropy (HiSCE) is a hierarchy-aware objective formulated for @@@@1@@@@ vision-LLMs (VLMs) under structured taxonomies, particularly in scenarios where class labels possess multi-level granularity such as order, family, and species. HiSCE loss, proposed by the authors of “Hierarchy-Aware Fine-Tuning of Vision-LLMs” (Li et al., 25 Dec 2025), introduces sibling-sensitive smoothing to cross-entropy, which encourages robust and taxonomically consistent predictions with efficient parameter adaptation schemes.

1. Formal Construction and Notation

Let LL denote the depth of the taxonomy and y(1),y(2),,y(L){y^{(1)}, y^{(2)}, \ldots, y^{(L)}} the path of ground-truth labels per example, where y(l){1,,Cl}y^{(l)} \in \{1, \ldots, C_l\} indicates the correct class at taxonomy level ll. For each class ii at level ll, its sibling set is defined as S(i)={j{1,,Cl}ji,parent(j)=parent(i)}S(i) = \{j \in \{1,\ldots, C_l\} \mid j \ne i,\, \text{parent}(j)=\text{parent}(i)\}. Model predictions at level ll are p(l)=Softmax(z(l))p^{(l)} = \text{Softmax}(z^{(l)}) over class scores z(l)z^{(l)} (image–text cosine similarities in multimodal VLMs).

HiSCE operationalizes a Categorical smoothing matrix T(l)T^{(l)} for each level ll:

Tij(l)={1ϵl,if j=i, ϵlS(i),if jS(i), 0,otherwise.T^{(l)}_{i j} = \begin{cases} 1 - \epsilon_l, & \text{if } j = i, \ \frac{\epsilon_l}{|S(i)|}, & \text{if } j \in S(i), \ 0, & \text{otherwise}. \end{cases}

Given a label y(l)=iy^{(l)}=i, the smoothed target distribution y~(l):=Ti,:(l)\tilde y^{(l)} := T^{(l)}_{i,:} is used for cross-entropy at level ll:

LHiSCE(l)=c=1Cly~c(l)logpc(l),\mathcal{L}_{HiSCE}^{(l)} = -\sum_{c=1}^{C_l} \tilde y^{(l)}_c \log p^{(l)}_c,

and the total HiSCE loss is the sum over taxonomy levels:

LHiSCE=l=1LLHiSCE(l).\mathcal{L}_{HiSCE} = \sum_{l=1}^L \mathcal{L}_{HiSCE}^{(l)}.

Normalization is explicit: (1ϵl)+S(i)ϵlS(i)=1(1-\epsilon_l) + |S(i)|\cdot\frac{\epsilon_l}{|S(i)|} = 1 per distribution.

2. Motivation and Rationale

Standard cross-entropy penalizes any deviation from the ground-truth class equally, driving the model to concentrate probability mass exclusively on that class. In hierarchical classification, this leads to excessive overconfidence, particularly within fine-grained sibling categories, resulting frequently in taxonomically inconsistent predictions such as selecting a leaf whose parent node is mismatched. By redistributing a small probability mass ϵl\epsilon_l uniformly to siblings, HiSCE softens intra-level decision boundaries (“horizontal” smoothing), operationalizing semantic and visual proximity of siblings. The effect is to penalize “close” sibling misclassifications less severely than errant predictions outside the correct branch, directly reducing tree-structure inconsistency metrics while also enabling more robust uncertainty modeling for ambiguous visual data.

3. Integration with Hierarchy-Aware Fine-Tuning Frameworks

HiSCE is integrated in a multi-term fine-tuning objective, working synergistically with both standard cross-entropy at the leaf level (LCE\mathcal{L}_{CE}) and a Tree-Path KL divergence (LTPKL\mathcal{L}_{TP-KL}), which enforces pathwise vertical coherence:

Ltotal=LCE+λ1LTP-KL+λ2LHiSCE\mathcal{L}_{total} = \mathcal{L}_{CE} + \lambda_1\,\mathcal{L}_{TP\text{-}KL} + \lambda_2\,\mathcal{L}_{HiSCE}

All computations occur in the shared embedding space of frozen CLIP backbones adapted with LoRA modules. Only low-rank LoRA matrices (\sim4.4M parameters) and layer norm weights are updated, achieving computational efficiency and low resource requirements for adaptation. HiSCE’s design ensures compatibility with lightweight parameter update protocols prevalent in scalable model fine-tuning.

4. Empirical Evaluation and Observed Gains

Empirical studies across four benchmark datasets—CUB-200-2011, FGVC-Aircraft, Butterfly-200, and ChestX-ray14—demonstrate that replacing or augmenting standard cross-entropy with HiSCE (ϵl=0.1,λ2=1.0\epsilon_l = 0.1,\, \lambda_2 = 1.0) yields substantial improvements in Full-Path Accuracy (FPA) and reductions in Tree-based Inconsistency Error (TICE). For example, on CUB-200-2011, FPA improves from 50.2 to 63.1 and TICE decreases from 21.9 to 10.8. Comparable gains are observed across all reported benchmarks. Joint ablation with TP-KL loss shows that the combination offers the best overall hierarchy-aware performance (Li et al., 25 Dec 2025). This suggests the essentiality of both vertical and horizontal consistency in taxonomy adaptation.

Dataset Metric CE Baseline HiSCE CE+HiSCE+TP-KL (best)
CUB-200-2011 FPA 50.2 63.1 [see article Table 9]
CUB-200-2011 TICE 21.9 10.8 [see article Table 9]
FGVC-Aircraft FPA 38.3 57.0 [see article Table 9]
FGVC-Aircraft TICE 17.9 11.7 [see article Table 9]

A plausible implication is that HiSCE directly mitigates taxonomic path inconsistencies associated with naïve cross-entropy objectives.

5. Hyperparameterization and Practical Considerations

HiSCE requires selection of a smoothing factor ϵl\epsilon_l per taxonomy level, with recommended values in [0.05,0.2][0.05, 0.2] and default ϵl=0.1\epsilon_l = 0.1. The loss weight λ2\lambda_2 should be set to 1.0 or tuned using a small validation split for optimal performance. Balancing λ2\lambda_2 with λ1\lambda_1 (TP-KL weight) can be conducted via a few Optuna trials; empirical observations indicate that a 1:1 ratio facilitates stable improvements across datasets. The methodology is robust to typical choices of hyperparameters, suggesting broad applicability in hierarchical domains.

6. Taxonomy-Aware Label Smoothing: Implications and Significance

HiSCE fundamentally generalizes conventional label smoothing by structurally redistributing probability mass only among sibling classes at each level, instead of uniformly across all classes. This taxonomy-aware formulation ensures the fine-tuning process respects target structure, yielding more robust and consistent hierarchical predictions while pairing naturally with vertical coherence objectives (e.g., TP-KL). The approach is compatible with large-scale vision-LLM fine-tuning using LoRA, offering substantial scalability and efficiency. These characteristics position HiSCE as a practical choice for structured prediction in hierarchical classification tasks where path-consistent output and efficient adaptation are critical requirements.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Hierarchy-Sibling Smoothed Cross-Entropy (HiSCE).