Hierarchy-Aware Losses in Classification

Updated 4 May 2026

Hierarchy-aware losses are loss functions that incorporate the structure of class hierarchies to assign penalty weights based on semantic proximity.
They are applied in various domains such as deep classification, metric learning, and multi-label prediction to improve coarse-level accuracy and generalization.
Implementation strategies include modified cross-entropy, triplet losses, and rank-based methods that align predictions with taxonomic relationships.

Hierarchy-aware losses are loss functions designed to exploit label taxonomies by explicitly using the structure of class hierarchies during supervised learning. Such formulations contrast with traditional “flat” losses, which treat all misclassifications as equally severe, regardless of the semantic proximity of classes. Hierarchy-aware losses are prevalent in deep classification, metric learning, vision-language modeling, multi-label prediction, and sequential modeling, including scenarios where classes are organized as trees, DAGs, or ultrametric spaces. Incorporating hierarchy in the loss typically leads to models that produce semantically safer mistakes, improved coarse-class accuracy, and in some cases better generalization and robustness under data scarcity or label imbalance.

1. Taxonomic Motivation and Loss Design Principles

Hierarchy-aware losses operationalize the intuition that errors between closely related classes (i.e., those sharing recent ancestors in a taxonomy) should be penalized less than errors involving distant branches. Canonical examples are found in biological taxonomies, document classification (e.g., DMOZ, RCV1), and fine-grained recognition (e.g., CIFAR-100, iNaturalist). Broadly, these losses assign penalties based on the shortest path, LCA (Lowest Common Ancestor) depth, or ultrametric distance in the hierarchy, thereby respecting semantic proximity.

Major design principles include:

Hierarchy-respecting error sensitivity: Loss severity reflects class proximity within the tree/DAG (e.g., bulldog → dog vs. bulldog → car) (Wu et al., 2017, Urbani et al., 2024).
Consistency across levels: The loss propagates error signals not just at the leaf node (finest class), but also along the root-to-leaf path, enforcing model consistency between superclass and subclass predictions (Urbani et al., 2024, Garg et al., 2022).
Proper scoring rule properties: Some constructions are proper scoring rules, ensuring that minimizing expected loss leads to Bayes-optimal class posteriors at all hierarchy levels (Urbani et al., 2024).
Applicability to multi-class, multi-label, metric learning, and sequence models: Hierarchy-aware objectives are applicable across paradigms, including multi-label BCE (Vaswani et al., 2022), contrastive embedding learning (Tian et al., 22 Jan 2025, Nolasco et al., 2021), and RNN sequence models (Mujika et al., 2019).

2. Mathematical Formulations of Hierarchy-Aware Losses

Multiple hierarchy-aware loss families can be identified:

Hierarchical Cross-Entropy and Proper-Scoring-Rule Losses

Given a tree $T=(V,E)$ over $K$ fine classes and superclass weights $w_j$ , the hierarchical loss augments per-example penalties at each ancestor node:

$L_T[f,y] = -\sum_{j \in a(y)} w_j \log \left( \sum_{k \in v_j} f_k \right)$

where $a(y)$ is the set of ancestors of leaf $y$ and $v_j$ is the descendant set of node $j$ (Urbani et al., 2024). This generalizes the standard cross-entropy and allows for weight schemes (e.g., exponentially decaying with depth).

Jensen–Shannon and Geometric Consistency Losses

For each hierarchy level $\ell$ , let $g^\ell$ be a classifier for that level. Enforce cross-level consistency via the Jensen–Shannon divergence between coarse classifier output $K$ 0 and soft targets obtained by marginalizing fine-level predictions $K$ 1:

$K$ 2

where $K$ 3 is the symmetric JSD and $K$ 4 is built from fine-level predictions (Garg et al., 2022). Additionally, a geometric consistency loss constrains classifier weights so that child class weight vectors align with their parent's prototype direction.

Hierarchical Penalty in Metric and Embedding Learning

Per-level classification heads and triplet losses: Hybrid loss functions combine per-node binary heads, per-level softmax heads, and generalized triplet losses, such that semantic closeness in the hierarchy translates to proximity in embedding space (Tian et al., 22 Jan 2025).
Rank-based losses: Embedding distances are aligned to the hierarchy by requiring that all pairs of representations obey a prescribed rank order reflecting tree-distance (Nolasco et al., 2021).

Severity-Weighted Losses for Multi-label Prediction

CHAMP introduces a penalty modulation term $K$ 5 proportional to the tree-distance from a predicted label to the nearest (or all) ground-truth labels, scaling the binary cross-entropy:

$K$ 6

with $K$ 7 increasing with distance to $K$ 8 (Vaswani et al., 2022).

Hierarchy-Aware Neural Collapse and Frame-Based Losses

By embedding the hierarchy in a similarity matrix $K$ 9 and constructing a hierarchy-aware frame (HAFrame), the auxiliary cosine-similarity loss aligns features for class $w_j$ 0 with the desired geometry:

$w_j$ 1

which is combined with cross-entropy under a mixing factor $w_j$ 2 (Liang et al., 2023).

3. Practical Implementation Strategies

Hierarchical loss integration varies by application:

Softmax-based classifiers: Hierarchical objectives can be implemented with a sparse “aggregation matrix” collecting leaf-level class probabilities into superclasses; this adds minimal computational overhead (~1–2%) (Urbani et al., 2024).
Multilevel classifier architecture: Models may include one classifier per hierarchy level with shared backbone; only the finest-level head is used at test time (Garg et al., 2022).
Multi-label scenarios: Compute and cache tree-distances between all classes; adjust per-class loss contributions via penalty scaling (Vaswani et al., 2022).
Vision-language and embedding models: Calculation of horizontal (intra-level) smoothing and vertical (path) KL divergence maintain coherent predictions across the taxonomy, even under parameter-efficient adaptation (e.g., LoRA) (Li et al., 25 Dec 2025).
Memory and computational constraints: In sequence models, locally computable losses on each hierarchical level permit dramatic memory savings by decoupling cross-level backpropagation (Mujika et al., 2019).
Batch construction for contrastive/rank losses: For batch-based rank loss, ensure every pairwise tree-distance (rank) is present to stabilize and speed up convergence (Nolasco et al., 2021).

4. Empirical Effects and Evaluation Metrics

The principal empirical effect of hierarchy-aware losses is a reduction in “mistake severity,” typically quantified as mean LCA height, mean tree distance, or other surrogates for semantic error cost. Hierarchical losses can maintain or even improve standard accuracy (e.g. top-1) while shifting the error distribution toward milder, semantically plausible confusions:

Dataset	Flat CE Top-1	Hier-Aware Top-1	Flat Severity	Hier-Aware Severity	Reference
CIFAR-100	22.27%	22.27–22.31%	2.35	2.23–2.24	(Garg et al., 2022)
iNat2019 (6/img/cls)	37.2%	38.4%	0.438	0.391	(Urbani et al., 2024)
FGVC-Aircraft	80.49	81.0	2.15	2.02	(Liang et al., 2023)

Additionally, metrics evaluating the structure of the learned embedding (mean normalized rank, NDCG, silhouette at multiple tree depths) show improvement when hierarchy-aware losses are used, both for seen and unseen classes (Tian et al., 22 Jan 2025, Nolasco et al., 2021). In multi-label HMC, AUPRC, hierarchical precision, and robustness to label tails and corruptions are bettered by severity-weighted penalties (Vaswani et al., 2022). In sequence modeling, hierarchy-aware local losses recover the learning performance of full cross-level backpropagation at a fraction of the memory cost (Mujika et al., 2019).

5. Theoretical Guarantees and Optimization Properties

Some hierarchy-aware losses are proven proper scoring rules, such as the superclass-aware cross-entropy (Urbani et al., 2024). This implies that models trained with such losses converge in expectation to the Bayes-optimal conditional posteriors for every class in the hierarchy. Other constructions (e.g., class-based curriculum loss) are shown to be the tightest upper bound on the hierarchical 0-1 error relative to the base loss (Goyal et al., 2020). Neural collapse phenomena are induced by hierarchy-aware frame construction, ensuring collapse to frames with pairwise similarities encoding the tree (Liang et al., 2023). Proper marginless rank-based losses impose no hyperparameters and are agnostic to the hierarchy’s depth or incompleteness (Nolasco et al., 2021).

However, purely “raw” hierarchical losses (e.g., ultrametric win) can be degenerate under SGD—minimizing CE drives down hierarchical loss almost as efficiently, unless the optimization is modified to tie together parameters of semantically close leaves (Wu et al., 2017).

6. Comparative Analyses, Limitations, and Future Investigations

Hierarchy-aware losses outperform flat counterparts in settings with scarce data, fine-grained or skewed taxonomies, or pronounced labeling cost asymmetries. However, in regimes with abundant and balanced samples, the advantage diminishes; powerful deep networks tend to recover taxonomies implicitly (Urbani et al., 2024, Wu et al., 2017). Some methods (e.g., hierarchical log-loss) can trade off fine accuracy for coarse accuracy unless jointly optimized with standard objectives (Wu et al., 2017). Hierarchy-aware losses based solely on the label topology do not inherently address open-set or incomplete labeling, although rank-based and marginless constructions generalize more gracefully (Nolasco et al., 2021).

Open directions identified include:

Efficient sampling and mining of hierarchical triplets/quadrupes (Nolasco et al., 2021, Tian et al., 22 Jan 2025).
Incorporating hierarchical curriculum scheduling to accelerate convergence (Goyal et al., 2020).
Hybrid architectures mixing parameter tying, multi-task heads, and global distance-based losses.
Adapting hierarchy-aware loss constructions to hyperbolic or non-Euclidean embeddings.

7. Notable Variants and Application Domains

Hierarchy-aware losses have been deployed in numerous domains:

Image and audio retrieval: Hierarchical proxy-based and hybrid triplet/per-level losses organize embedding spaces for retrieval and transfer (Yang et al., 2021, Tian et al., 22 Jan 2025).
Vision–LLMs: Simultaneous vertical and horizontal taxonomy alignment using KL and cross-entropy variants (Li et al., 25 Dec 2025).
Multi-label classification: Severity-adjusted BCE and modular class weights for label trees (Vaswani et al., 2022).
Hierarchical sequence learning: Locally computable losses in HRNNs for scalable long-sequence modeling (Mujika et al., 2019).

The field continues to explore broader forms of taxonomic structure, loss–model co-design, and task- and domain-adaptive hierarchy-aware loss functions.