Hierarchical Inclusion Loss in ML

Updated 29 November 2025

Hierarchical inclusion loss is a family of objective functions that encode tree-structured relations by aggregating prediction probabilities across multiple hierarchical levels.
It enforces semantic consistency by weighting ancestor nodes, ensuring that improvements at finer levels enhance performance at coarser levels without adversarial tradeoffs.
It has demonstrated empirical benefits in large-scale, imbalanced datasets, notably improving coarse-level accuracy and performance in domains like multi-label prediction and medical imaging.

Hierarchical inclusion loss encompasses a family of objective functions designed to encode and exploit tree-structured or taxonomic relations among output classes in machine learning models. By construction, these losses enforce both semantic consistency and explicit aggregation of prediction probabilities or penalties across all hierarchical levels, in contrast to flat classification objectives that ignore inter-class dependencies. The paradigm is central to large-scale classification, multi-label prediction, metric learning, and contrastive representation learning, with established theoretical properties and empirically verified performance benefits across domains.

1. Formal Definition and Essential Properties

Hierarchical inclusion loss refers to any loss function constructed such that model outputs are simultaneously scored at multiple levels of a known class hierarchy—typically a rooted tree. For a tree $T$ , each internal node $v_j$ (a superclass) encompasses a subset of leaves $v_j \subseteq \{1, ..., K\}$ (the K fine-grained classes). For a given sample with ground-truth class $y$ , the loss enforces high prediction mass over each ancestor node of $y$ , weighted by per-level coefficients.

In the canonical instantiation for softmax outputs, as introduced by Bucher et al. (Urbani et al., 25 Nov 2024), the hierarchical inclusion loss $L_T(f(x;\theta),y)$ is:

$L_T(f(x;\theta),y) = - \sum_{j \in a(y)} w_j \log\left(\sum_{k\in v_j} f_k(x;\theta)\right),$

where $a(y)$ is the ancestor set of node $y$ and $f_k(x;\theta)$ is the predicted softmax probability for leaf $k$ . The weights $w_j$ are chosen so that the sum over any root-to-leaf path is constant, ensuring the loss is a proper scoring rule minimized only by the true posterior distribution.

The key structural property is inclusion consistency: improvement at a fine level (leaf) necessarily improves the aggregate score at all ancestors (superclasses); there is no adversarial tradeoff between levels.

2. Relation to Other Hierarchical Objective Constructions

Hierarchical inclusion loss differs fundamentally from multi-level cross-entropy (which applies separate classification losses per hierarchy level) and conditional decomposition losses (such as the HXE variant). Its strict properness (as proved in (Urbani et al., 25 Nov 2024)) guarantees optimality at all granularities without need for heuristic weighting, in contrast to methods that require explicit tuning between fine and coarse errors.

Wu et al. (Wu et al., 2017) introduced an "ultrametric" hierarchical loss designed to penalize mistakes by tree-distance, but their experiments reveal that, when optimized via standard SGD, hierarchical losses often provide no improvement over flat cross-entropy unless coarse-level accuracy (rather than fine-level) is prioritized or data is highly imbalanced.

Goyal & Ghosh (Goyal et al., 2020) formalized loss-based inclusion constraints, requiring that mistakes for finer (deeper) classes are penalized at least as much as for coarser classes, resulting in surrogates that provably upper bound the 0-1 loss as tightly as possible, contingent on hierarchy satisfaction.

3. Component Formulations and Backpropagation

Hierarchical inclusion loss incorporates summation masks or indicator matrices to aggregate predictions from children to parents. For softmax-based architectures, aggregation is additive (mutually exclusive children), while for multi-label/sigmoid outputs, aggregation employs inclusion–exclusion combinatorics, as developed in (Shkodrani et al., 2021):

$\hat{p}(C_i^{l+1}) = \sum_{k_1\in\mathcal{K}_i^l}\hat{p}(C_{k_1}^l) - \sum_{k_1<k_2}\hat{p}(C_{k_1}^l)\hat{p}(C_{k_2}^l) + \cdots$

The total loss is a weighted sum across levels:

$\mathcal{L}_{\mathrm{hier}} = \sum_{l=1}^{L} w_l\,\mathcal{L}_l(\hat{y}^l, y^l)$

where $\mathcal{L}_l$ may be cross-entropy or focal/binary loss per node.

Gradient computation proceeds by differentiating with respect to child probabilities and propagating error signals through the aggregation matrix for each parent, maintaining computational complexity that scales as the number of hierarchy edges.

4. Hierarchical Penalty and Constraint Enforcement

Several variants introduce explicit penalties for hierarchy violations, especially in multi-label or clinical settings. For instance, the HBCE loss (Asadi et al., 5 Feb 2025) in medical imaging imposes an additional cost for parent–child inconsistencies (child positive, parent negative):

$L_{\mathrm{HBCE}} = L_{\mathrm{BCE}} + \lambda\,\frac{1}{B}\sum_{b=1}^B\sum_{(p,c)\in\mathcal{E}} P_{p,c}^{(b)},$

where $P_{p,c}^{(b)}$ triggers when the parent is predicted as negative but child as positive, with either fixed or data-driven penalties, enforcing hierarchical logical relations in output.

Contrastive variants (TaxCL (Kokilepersaud et al., 10 Jun 2024); (Zhang et al., 2022)) structure the InfoNCE denominator to assign greater weight to negatives at shared-taxonomy levels, inflating the discriminative force for taxonomic splitting and preserving hierarchy in representation space.

5. Curriculum and Adaptivity over Class Granularity

Hierarchical curriculum loss (Goyal et al., 2020) integrates the inclusion constraint with dynamic class selection, promoting focus on coarser classes early in training. A binary mask $s_j$ selects classes engaged in the current epoch, optimizing:

$l_{hc}(y, \hat{y}) = \min_{s\in\{0,1\}^C} \max\left( \sum_j s_j \ell_h(y_j, \hat{y}_j), (C-\sum_j s_j) + e_h(y, \hat{y}) \right)$

Classes with lower current loss are prioritized, thereby implementing an implicit curriculum that aligns with the hierarchy depth.

6. Implementation Protocols and Computational Efficiency

Hierarchical inclusion loss is implemented by:

Precomputing aggregation masks for parents at each hierarchy level (sparse matrices).
Computing forward predictions at the leaf level (softmax or sigmoid).
Propagating upward via aggregation operators (sum or inclusion–exclusion).
Evaluating per-level losses using ground-truth ancestry vectors.
Applying backpropagation through aggregation steps; standard autodiff suffices.

The additional runtime cost versus flat cross-entropy is marginal in practice—typically a few percent increase in wall-clock time due to sparse matrix operations—provided the hierarchy depth and maximum fanout remain moderate (Urbani et al., 25 Nov 2024).

7. Empirical Impact, Limitations, and Evaluation

Across benchmarks (ImageNet, TinyImageNet, iNaturalist, CheXpert, DeepFashion, COCO), hierarchical inclusion losses consistently reduce coarse-level error, improve mean accuracy at superclass granularity, and yield competitive or slightly better fine-level accuracy compared to flat objectives. Gains are most pronounced in low-data regimes or when hierarchical errors are disproportionally costly in the downstream application.

However, hierarchical inclusion losses may provide limited improvement or even harm fine-class accuracy when the hierarchy is ill specified, data is abundant per class, or if the optimization scheme does not fully exploit tree-structure (as highlighted in (Wu et al., 2017)). As evaluation metrics, hierarchical losses (e.g., hierarchical win, lowest-common-ancestor height) remain recommended for reporting and benchmarking in hierarchical classification tasks.

Paper	Domain	Loss Type	Main Benefit
(Urbani et al., 25 Nov 2024)	Classification	Inclusion CE	Coarse error reduction; proper scoring rule; efficiency
(Shkodrani et al., 2021)	Detection/Cls	Multi-level CE/FL	Cross-task applicability; multi-label support
(Asadi et al., 5 Feb 2025)	Medical	HBCE	Logical consistency; interpretability
(Zhang et al., 2022)	Contrastive	Hier. Contrastive	Embedding structure, clustering performance
(Goyal et al., 2020)	Cls (images)	Curriculum	Tightest upper bound on 0-1; curriculum over granularity

Hierarchical inclusion loss represents a theoretically robust and practically flexible approach to leveraging semantic and structural relationships in output spaces, with formal consistency guarantees and empirically demonstrated advantages in hierarchically organized domains.