Human-Aware Loss Functions (HALOs)
- Human-Aware Loss Functions are a class of loss functions that integrate human-derived constraints and perceptual cues into training, enhancing model generalization and interpretability.
- They incorporate semantic, perceptual, and saliency-based regularizations to align model predictions with human context, particularly improving performance in data-scarce regimes.
- HALOs are applied in diverse domains including human activity recognition, image restoration, and saliency modeling, enabling context-aware decision-making and state-of-the-art results.
Human-Aware Loss Functions (HALOs) are a class of loss functions that incorporate explicit knowledge about human behaviors, perceptual abilities, or context-dependent constraints directly into the training objectives of neural networks. Established across multiple domains—including human activity recognition, perceptual image restoration, and human-annotated saliency modeling—HALOs are designed to align learning with high-level symbolic rules, human perception, or human annotation without invoking external reasoning modules at inference time. This differentiable guidance enables deep models to generalize better in data-scarce regimes, improve interpretability, and achieve state-of-the-art performance in context-critical tasks.
1. Taxonomy and Definitions of Human-Aware Losses
HALOs encompass a family of loss functions constructed to inject human-derived constraints or knowledge into neural network optimization. Three principal paradigms emerge in contemporary literature:
- Context-Aware Semantic Losses: Encourage predictions that satisfy ontological rules relating human activities and context (Arrotta et al., 2023).
- Human-Perceptual Losses: Penalize outputs according to perceptual similarity metrics modeled after the human visual system, such as SSIM and its variants (Zhao et al., 2015).
- Human Saliency Alignment Losses: Regularize models to produce attention or saliency maps concordant with explicit human annotation, e.g., through crowd-sourced "explain your decision" saliency (Boyd et al., 2021).
A canonical HALO is integrated additively with a standard task loss (such as cross-entropy), with a hyperparameter modulating the strength of the human-aware term.
2. Formalization and Variants of HALOs
Several specific HALO formulations have been proposed, each instantiating the principle of human-grounded regularization via different mathematical constructs:
Semantic Loss Functions for Context-Consistency (Arrotta et al., 2023)
Let denote the output probability vector over classes for input , and the set of activities consistent with the high-level context under an ontology . The following semantic losses guide the network to favor contextually-legal activities:
- AllConsistentActs (All):
Encourages the sum probability mass over allowed classes.
- MinusProb-Prob (–PP):
- Zero-One (01):
- MinusProb-One (–P1) and Zero-Prob (0P): Variants with different penalization of confidence or misalignment.
Semantic losses are integrated as:
where 0 is the standard cross-entropy, 1 the ground-truth label, and 2 a regularization weight.
Human Saliency Alignment Losses (Boyd et al., 2021)
In image classification, HALOs may take the form:
3
4, where 5 is the human-provided saliency map, and 6 is extracted from the model (e.g., CAM). 7 (or equivalently, a trade-off parameter 8) determines the weighting.
Perceptual Losses for Image Restoration (Zhao et al., 2015)
For low-level vision tasks, the "Mix" loss combines multi-scale SSIM and pixelwise 9:
0
Here, 1 is the multi-scale structural similarity loss, 2 is a Gaussian kernel at the coarsest scale, and 3 balances structural and absolute fidelity.
3. Construction and Differentiability from Symbolic or Human-Centric Knowledge
HALOs depend on mapping symbolic predicates or human data sources into differentiable loss terms:
- Ontology-Based Context Constraints (Arrotta et al., 2023): Activity classes carry symbolic context restrictions, e.g., 4. At each training step, a reasoner computes which activities are context-consistent, and the semantic loss penalizes violations. The penalization is constructed to be differentiable almost everywhere for backpropagation.
- Saliency Alignment (Boyd et al., 2021): Human saliency maps are aggregated and processed into smooth, real-valued matrices, and compared to model CAMs using 5 loss, which is differentiable with respect to model parameters.
- Perceptual Similarity (Zhao et al., 2015): SSIM and MS-SSIM depend on local statistics (mean, variance, covariance) of image patches and are differentiable via the chain rule with respect to pixel outputs, enabling end-to-end gradient-based optimization.
4. Network Architectures, Datasets, and Optimization Protocols
HALOs have been deployed in diverse neural architectures and evaluated on both synthetic and real-world datasets:
- Context-Aware HAR (Arrotta et al., 2023):
- Input: Smartphone and smartwatch inertial streams, high-level context.
- Architecture: Parallel 1D convolutions for inertial signals; context one-hot processed by dense layer; feature concatenation followed by dropout and dense layers with softmax.
- Datasets: DOMINO (scripted, 25 users, 14 activities, contexts) and ExtraSensory (in-the-wild).
- Training: Adam optimizer, batch size 32, cross-validation per user.
- Image Classification with Saliency Supervision (Boyd et al., 2021):
- Models: DenseNet-121, ResNet-50, Inception-v3, Xception.
- Input saliency maps: Collected via crowd-sourcing and preprocessed.
- Training: SGD with momentum, cross-entropy and human saliency loss, batch size 32, early stopping.
- Perceptual Image Restoration (Zhao et al., 2015):
- Fully convolutional network, e.g., for denoising + demosaicking, super-resolution.
- Loss computed on central pixels or patches using Mix (MS-SSIM + 6).
- Datasets: Images for denoising, super-resolution, JPEG artifact removal.
5. Empirical Results, Efficiency, and Comparative Analysis
Substantial improvements are reported across tasks and settings for models trained with HALOs, particularly under data scarcity or domain shifts.
Quantitative Results
| Domain | Baseline | HALO/Best | Δ Metric | Dataset | Loss Variant |
|---|---|---|---|---|---|
| HAR (DOMINO, 100%) | F1=0.90 | F1=0.93 | +0.026 | 25 users, 14 activities | Semantic (–P1, α=7) |
| HAR (ExtraSensory, 10%) | F1=0.52 | F1=0.59 | +0.067 | 31 users, 7 activities | AllConsistentActs, α=30 |
| Face Detection (ResNet) | AUC=0.55 | AUC=0.67 | +0.12 | 600k synthetic faces | CYBORG (α=0.5) |
| Image Restoration (Mix) | PSNR/SSIM/MS-SSIM ↑ | Highest across all | Denoise, SISR, Deblock | Mix (MS-SSIM+7) |
- In (Arrotta et al., 2023), semantic loss