Hierarchical Cross-Entropy Loss

Updated 13 March 2026

Hierarchical cross-entropy loss is a metric that leverages tree-structured taxonomies to weight classification errors by semantic distance.
It employs techniques like path factorization and per-level aggregation to integrate label granularity and address class imbalance.
Empirical studies show enhanced robustness, reduced semantic error penalties, and improved sample efficiency in multi-level tasks.

A hierarchical cross-entropy loss is any loss function designed to train classifiers when labels are organized in a tree (or DAG), expressing semantic relationships between classes at varying granularity. In contrast to standard (“flat”) cross-entropy, which penalizes all misclassifications equally, hierarchical variants leverage the label taxonomy to reflect the relative severity of errors, exploit labels at different granularity, correct for class imbalance, and regularize predictions according to tree structure. Contemporary formulations range from path-based negative log-likelihood (factorized or per-level) to cost-matrix modulated extensions, optimal-transport risks, and proper-scoring multi-level aggregations, each with distinct trade-offs in sample efficiency, semantic error-shaping, and computational overhead.

1. Taxonomy Modeling and General Principles

Hierarchical cross-entropy losses assume a rooted tree or DAG (e.g., $G=(V, E)$ ), with nodes representing classes at coarser (internal nodes) and finer (leaves) scales. Each sample $x$ can carry a label at any node, and the tree defines a semantic distance metric over labels.

Two organizational paradigms dominate:

Path factorization: Probabilities are decomposed as products of conditional probabilities along unique paths from the root to each node, with sibling-group softmaxes at each split.
Per-level aggregation: Each level of the tree acts as a multi-class (softmax) or multi-label (sigmoid) prediction; probabilities for parents are obtained by set-unions (sums or inclusion–exclusion) over children (Villar et al., 2023, Shkodrani et al., 2021).

Hierarchical cross-entropy losses use these structures to:

Exploit labels at arbitrary depths (not merely leaves)
Respect semantic similarity (nearby errors penalized less harshly)
Encode class imbalance via per-node weights
Regularize outputs at both fine and coarse levels

2. Formal Definitions and Core Variants

Several rigorous hierarchical cross-entropy constructions are established:

For a sample with path $c^{(H)} \to c^{(H-1)} \to ... \to c^{(0)}$ , define

$\mathcal{L}_{\rm WHXE} = -\sum_{\ell=0}^{H-1} W(c^{(\ell)})\,\lambda(c^{(\ell)})\,\log p(c^{(\ell)}|c^{(\ell+1)}).$

$W(c)$ : Class-imbalance weight, e.g., $N_\text{all}/(N_\text{labels}\, N(c))$
$\lambda(c)$ : Depth-dependent emphasis, such as $\exp(-\alpha\,h(c))$ Forward pass computes softmax over each group of siblings; loss sums negative log-likelihood for each true-path node with reweighting.

For tree with $L$ levels, aggregate per-level probabilities:

Softmax: $\hat{p}(C^{l+1}_i) = \sum_{k} \hat{p}(C^l_k)$ over children
Sigmoid: Inclusion–exclusion expansion for non-exclusive children Total loss:

$\mathcal{L}_\text{hier} = \sum_{l=1}^L w_l\,\mathcal{L}_{\mathrm{CE}}(\hat{y}^l, y^l)$

with user-set $w_l$ to emphasize fine or coarse levels.

For a sample with label $y$ , the loss is

$L_T(f(x), y) = -\sum_{j \in \mathrm{anc}(y)} w_j \log \biggl( \sum_{k \in v_j} f_k(x) \biggr)$

where $v_j$ is a subtree (superclass) and $w_j$ sum to $1/2$ along any ancestral path. This loss is minimized if and only if predictions recover the true conditional distribution (proper scoring rule), simultaneously encouraging fine and coarse accuracy.

If $A$ is a cost-matrix encoding class–class semantic distances,

$L_{CE+B} = (1-\alpha)L_{CE} + \alpha\,y^T A \hat{y}$

$L_{CE+LB} = (1-\alpha)L_{CE} - \alpha\,y^T A \log(1-\hat{y})$

with $a_{i j}$ from distances in the hierarchy tree and $\alpha$ a trade-off parameter. This penalizes confusions in proportion to semantic distances (e.g., intra- vs inter-superclass).

3. Algorithmic Implementation and Engineering

Implementation depends on structure and reweighting approach:

Sibling softmax: Construct separate softmaxes for each group of siblings in the tree (Villar et al., 2023).
Level aggregation: Use sparse mapping matrices to propagate probabilities upward from leaves to all ancestors (Urbani et al., 2024, Shkodrani et al., 2021).
Batch computation: For each mini-batch, compute losses at all levels; apply class-imbalance and depth weights if present.
Gradient flow: Hierarchical losses are differentiable through all conditional and aggregation steps, ensuring compatibility with standard backpropagation.

Per-sample or mini-batch loss can be written as

$\mathcal{L}_\text{batch} = \frac{1}{B} \sum_{j=1}^B \Bigl[ -\sum_{\ell=0}^{H-1} W(c^{(\ell)}_j)\lambda(c^{(\ell)}_j) \log p_j(c^{(\ell)}_j | c^{(\ell+1)}_j) \Bigr]$

or the multi-level sum as appropriate.

Typical additional complexity is $O(K \log K + B \log K)$ per batch (hierarchical aggregation), a modest increase over flat CE (Urbani et al., 2024).

4. Empirical Performance and Use Cases

Hierarchical cross-entropy losses have demonstrated advantages in diverse settings:

Astrophysical transient classification: WHXE achieves macro- and micro-F1 comparable to fine-tuned flat baselines, but retains 100% of examples (vs. 30–60% loss in flat-CE pipelines lacking fine labels), improves coarse-level F1, and unifies model engineering (Villar et al., 2023).
Fine-grained recognition: Hierarchical CE improves top-1 error and reduces “coarse” semantic mistakes on ImageNet, COCO, CIFAR-100, and iNaturalist; gains are most pronounced in scarce-data regimes (Urbani et al., 2024, Resheff et al., 2017, Chen et al., 2019).
Class-imbalanced domains: Hierarchical frameworks, particularly when extended to hybrid Dice/focal losses, offer improved sample efficiency and recall for rare classes (Yeung et al., 2021).
Object detection: Multi-level hierarchical CE or focal losses propagate accuracy gains across both leaf- and high-level targets (Shkodrani et al., 2021).

A summary of empirical findings: | Dataset | Flat CE Error | Hier. CE Error | Coarse Error | Notable Features | |-------------------|--------------|----------------|--------------|-------------------------------| | CIFAR-100 | 37.4% | 36.9% | 24.0% | 4–5% more errors kept in super-class (Resheff et al., 2017) | | ZTF Transients | Comparable | Comparable | +0.04 macro-F1 parent | Full label usage (Villar et al., 2023) | | iNaturalist’19 | 46.6% (CE) | 44.5% (HXE) | N/A | Hier. error consistently lower (Shkodrani et al., 2021) | | COCO (Det.) | 37.3 mAP | 37.44 mAP | +0.37 mAP @top | Detector-level improvements (Shkodrani et al., 2021) |

5. Class Imbalance, Depth Weighting, and Semantic Distances

Hierarchical cross-entropy losses often include mechanisms to adjust for two key factors:

Class imbalance: Rare classes are up-weighted, for example by $W(c) = N_\text{All} / (N_\text{Labels} N(c))$ (Villar et al., 2023), or via inverse-frequency weighting at each softmax/sigmoid head (Tian et al., 22 Jan 2025).
Depth weighting: Coarse errors can be up-weighted (via $\lambda(c) = \exp(-\alpha h(c))$ ) or controlled per-level via user-set $w_\ell$ (Villar et al., 2023, Shkodrani et al., 2021).
Semantic distances: Cost matrices $A$ are set by, e.g., hierarchy depth, tree-induced error, or ultrametric structure; loss terms penalize errors proportionally (Resheff et al., 2017, Ge et al., 2021, Wu et al., 2017).

Variants such as optimal-transport DOT risks (Ge et al., 2021) and the ultrametric loss (Wu et al., 2017) explicitly leverage these distances to control how confusions across distant branches are penalized.

6. Comparison to Flat Cross-Entropy and Limitations

Standard cross-entropy is indifferent to structure among incorrect classes; it only rewards probability on the one-hot target. In contrast, hierarchical losses:

Utilize all available labeling granularity (internal nodes and leaves)
Propagate gradient signals at all tree levels
Penalize errors more heavily when they traverse high-level splits (semantically dissimilar classes)
Improve sample efficiency, especially under partial labeling and in low-data regimes

Limitations identified in recent studies:

For abundant data and rich models, the advantage of hierarchical cross-entropy can diminish; flat CE often drives down hierarchical loss almost as efficiently (Wu et al., 2017).
Over-weighting coarse levels (or rare classes) can harm fine-level accuracy if hyperparameters are not carefully set (Resheff et al., 2017, Villar et al., 2023).
Requires explicit, trusted hierarchy; performance depends on tree quality and weighting choices (Urbani et al., 2024).
Cost-matrix approaches can become memory-intensive for large class sets ( $O(k^2)$ for $k$ leaves) (Resheff et al., 2017).

7. Extensions, Applications, and Proper Scoring

Recent developments extend hierarchical cross-entropy into new problem domains:

Multi-task architectures: Hybrid losses combining per-level CE, binary hierarchically-encoded sigmoid losses, and feature-space triplet objectives improve classification, embedding regularity, and generalization to unseen classes (Tian et al., 22 Jan 2025).
Detection and segmentation: Hierarchical focal loss enables set-theoretic parent aggregation for multi-label detection architectures (Shkodrani et al., 2021), and hybrid (Dice + CE) frameworks for segmentation (Yeung et al., 2021).
Proper-scoring rules: Ensuring the hierarchical loss is minimized only by the true conditional posterior guarantees statistical consistency at every tree level and eliminates hand-tuned trade-offs between granularities (Urbani et al., 2024).

Potential applications include fine-grained taxonomy recognition, risk-aware prediction (penalizing severe confusions), few-shot/imputation scenarios, and domains where partial supervision or imbalanced data is prevalent, such as time-domain astrophysics, biodiversity inventories, and multi-scale medical diagnostics.

References:

"Hierarchical Cross-entropy Loss for Classification of Astrophysical Transients" (Villar et al., 2023)
"United We Learn Better: Harvesting Learning Improvements From Class Hierarchies Across Tasks" (Shkodrani et al., 2021)
"Harnessing Superclasses for Learning from Hierarchical Databases" (Urbani et al., 2024)
"Every Untrue Label is Untrue in its Own Way: Controlling Error Type with the Log Bilinear Loss" (Resheff et al., 2017)
"Embedding Semantic Hierarchy in Discrete Optimal Transport for Risk Minimization" (Ge et al., 2021)
"Hybrid Losses for Hierarchical Embedding Learning" (Tian et al., 22 Jan 2025)
"Learning with Hierarchical Complement Objective" (Chen et al., 2019)
"A hierarchical loss and its problems when classifying non-hierarchically" (Wu et al., 2017)
"Unified Focal loss: Generalising Dice and cross entropy-based losses to handle class imbalanced medical image segmentation" (Yeung et al., 2021)