HiSCE: Hierarchy-Sibling Smoothed Cross-Entropy

Updated 31 December 2025

HiSCE is a hierarchy-aware loss that applies sibling-sensitive smoothing to promote taxonomic consistency in vision-language models.
It integrates seamlessly with fine-tuning frameworks using low-rank adaptations, enabling robust predictions across multi-level classifications.
Empirical results show improved Full-Path Accuracy and reduced Tree-based Inconsistency Error on benchmarks like CUB-200-2011 and FGVC-Aircraft.

Hierarchy-Sibling Smoothed Cross-Entropy (HiSCE) is a hierarchy-aware objective formulated for @@@@1@@@@ vision-LLMs (VLMs) under structured taxonomies, particularly in scenarios where class labels possess multi-level granularity such as order, family, and species. HiSCE loss, proposed by the authors of “Hierarchy-Aware Fine-Tuning of Vision-LLMs” (Li et al., 25 Dec 2025), introduces sibling-sensitive smoothing to cross-entropy, which encourages robust and taxonomically consistent predictions with efficient parameter adaptation schemes.

1. Formal Construction and Notation

Let $L$ denote the depth of the taxonomy and ${y^{(1)}, y^{(2)}, \ldots, y^{(L)}}$ the path of ground-truth labels per example, where $y^{(l)} \in \{1, \ldots, C_l\}$ indicates the correct class at taxonomy level $l$ . For each class $i$ at level $l$ , its sibling set is defined as $S(i) = \{j \in \{1,\ldots, C_l\} \mid j \ne i,\, \text{parent}(j)=\text{parent}(i)\}$ . Model predictions at level $l$ are $p^{(l)} = \text{Softmax}(z^{(l)})$ over class scores $z^{(l)}$ (image–text cosine similarities in multimodal VLMs).

HiSCE operationalizes a Categorical smoothing matrix $T^{(l)}$ for each level $l$ :

$T^{(l)}_{i j} = \begin{cases} 1 - \epsilon_l, & \text{if } j = i, \ \frac{\epsilon_l}{|S(i)|}, & \text{if } j \in S(i), \ 0, & \text{otherwise}. \end{cases}$

Given a label $y^{(l)}=i$ , the smoothed target distribution $\tilde y^{(l)} := T^{(l)}_{i,:}$ is used for cross-entropy at level $l$ :

$\mathcal{L}_{HiSCE}^{(l)} = -\sum_{c=1}^{C_l} \tilde y^{(l)}_c \log p^{(l)}_c,$

and the total HiSCE loss is the sum over taxonomy levels:

$\mathcal{L}_{HiSCE} = \sum_{l=1}^L \mathcal{L}_{HiSCE}^{(l)}.$

Normalization is explicit: $(1-\epsilon_l) + |S(i)|\cdot\frac{\epsilon_l}{|S(i)|} = 1$ per distribution.

2. Motivation and Rationale

Standard cross-entropy penalizes any deviation from the ground-truth class equally, driving the model to concentrate probability mass exclusively on that class. In hierarchical classification, this leads to excessive overconfidence, particularly within fine-grained sibling categories, resulting frequently in taxonomically inconsistent predictions such as selecting a leaf whose parent node is mismatched. By redistributing a small probability mass $\epsilon_l$ uniformly to siblings, HiSCE softens intra-level decision boundaries (“horizontal” smoothing), operationalizing semantic and visual proximity of siblings. The effect is to penalize “close” sibling misclassifications less severely than errant predictions outside the correct branch, directly reducing tree-structure inconsistency metrics while also enabling more robust uncertainty modeling for ambiguous visual data.

3. Integration with Hierarchy-Aware Fine-Tuning Frameworks

HiSCE is integrated in a multi-term fine-tuning objective, working synergistically with both standard cross-entropy at the leaf level ( $\mathcal{L}_{CE}$ ) and a Tree-Path KL divergence ( $\mathcal{L}_{TP-KL}$ ), which enforces pathwise vertical coherence:

$\mathcal{L}_{total} = \mathcal{L}_{CE} + \lambda_1\,\mathcal{L}_{TP\text{-}KL} + \lambda_2\,\mathcal{L}_{HiSCE}$

All computations occur in the shared embedding space of frozen CLIP backbones adapted with LoRA modules. Only low-rank LoRA matrices ( $\sim$ 4.4M parameters) and layer norm weights are updated, achieving computational efficiency and low resource requirements for adaptation. HiSCE’s design ensures compatibility with lightweight parameter update protocols prevalent in scalable model fine-tuning.

4. Empirical Evaluation and Observed Gains

Empirical studies across four benchmark datasets—CUB-200-2011, FGVC-Aircraft, Butterfly-200, and ChestX-ray14—demonstrate that replacing or augmenting standard cross-entropy with HiSCE ( $\epsilon_l = 0.1,\, \lambda_2 = 1.0$ ) yields substantial improvements in Full-Path Accuracy (FPA) and reductions in Tree-based Inconsistency Error (TICE). For example, on CUB-200-2011, FPA improves from 50.2 to 63.1 and TICE decreases from 21.9 to 10.8. Comparable gains are observed across all reported benchmarks. Joint ablation with TP-KL loss shows that the combination offers the best overall hierarchy-aware performance (Li et al., 25 Dec 2025). This suggests the essentiality of both vertical and horizontal consistency in taxonomy adaptation.

Dataset	Metric	CE Baseline	HiSCE	CE+HiSCE+TP-KL (best)
CUB-200-2011	FPA	50.2	63.1	[see article Table 9]
CUB-200-2011	TICE	21.9	10.8	[see article Table 9]
FGVC-Aircraft	FPA	38.3	57.0	[see article Table 9]
FGVC-Aircraft	TICE	17.9	11.7	[see article Table 9]

A plausible implication is that HiSCE directly mitigates taxonomic path inconsistencies associated with naïve cross-entropy objectives.

5. Hyperparameterization and Practical Considerations

HiSCE requires selection of a smoothing factor $\epsilon_l$ per taxonomy level, with recommended values in $[0.05, 0.2]$ and default $\epsilon_l = 0.1$ . The loss weight $\lambda_2$ should be set to 1.0 or tuned using a small validation split for optimal performance. Balancing $\lambda_2$ with $\lambda_1$ (TP-KL weight) can be conducted via a few Optuna trials; empirical observations indicate that a 1:1 ratio facilitates stable improvements across datasets. The methodology is robust to typical choices of hyperparameters, suggesting broad applicability in hierarchical domains.

6. Taxonomy-Aware Label Smoothing: Implications and Significance

HiSCE fundamentally generalizes conventional label smoothing by structurally redistributing probability mass only among sibling classes at each level, instead of uniformly across all classes. This taxonomy-aware formulation ensures the fine-tuning process respects target structure, yielding more robust and consistent hierarchical predictions while pairing naturally with vertical coherence objectives (e.g., TP-KL). The approach is compatible with large-scale vision-LLM fine-tuning using LoRA, offering substantial scalability and efficiency. These characteristics position HiSCE as a practical choice for structured prediction in hierarchical classification tasks where path-consistent output and efficient adaptation are critical requirements.

PDF Markdown Chat (Pro)

References (1)

Hierarchy-Aware Fine-Tuning of Vision-Language Models (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Hierarchy-Sibling Smoothed Cross-Entropy (HiSCE).