CDW-CE for Ordinal Classification

Updated 16 December 2025

CDW-CE is a loss function for ordinal classification that penalizes errors based on the numerical distance between classes.
It incorporates a hyperparameter α and an optional margin to control penalty severity and enhance inter-class separation.
Empirical evaluations show that CDW-CE improves quantitative metrics and clinical interpretability compared to standard cross-entropy.

Class Distance Weighted Cross-Entropy (CDW-CE) is a loss function for training deep neural networks to solve ordinal classification problems where class labels possess a natural ordering. Unlike standard categorical cross-entropy (CE), which penalizes all misclassifications equally regardless of their semantic proximity, CDW-CE explicitly modulates the penalty assigned to an incorrect prediction based on the ordinal distance between the true and predicted classes. This approach is particularly advantageous in domains such as medical imaging, where misclassifying a sample to an adjacent class is less egregious than confusing widely separated classes, as in disease severity grading (Polat et al., 2022, Polat et al., 2 Dec 2024).

1. Mathematical Formulation

Given $N$ ordered classes $(0, 1, ..., N-1)$ , a sample with true label $c\in\{0,...,N-1\}$ , and predicted class probabilities $\hat{y}=(\hat y_0, ..., \hat y_{N-1})$ obtained via softmax, the standard cross-entropy loss is

$\mathrm{CE} = -\log(\hat y_c).$

Class Distance Weighted Cross-Entropy generalizes this by applying a polynomially growing penalty to mispredictions as a function of their class distance from the truth: $\mathrm{CDW\text{-}CE} = -\sum_{i=0}^{N-1}|i-c|^\alpha \log(1-\hat y_i),$ where the sum omits $i=c$ in practice to avoid $\log(0)$ .

$|i-c|$ is the absolute ordinal distance between predicted and true classes.
$\alpha\geq 0$ is a hyperparameter controlling the strength of distance penalization.

An extension of CDW-CE incorporates a margin $m>0$ to enhance inter-class separation: $\mathrm{CDW\text{-}CE_{margin}} = -\sum_{i=0}^{N-1}|i-c|^\alpha \log\left(1 - \max(\hat y_i + m, 1)\right),$ which encourages more compact class clusters in the embedding space (Polat et al., 2 Dec 2024).

The gradient

$\frac{\partial L}{\partial \hat y_i} = \frac{|i-c|^\alpha}{1-\hat y_i},\quad i\neq c,$

shows that the penalty grows both with distance and predicted probability for incorrect classes.

2. Implementation and Training Procedure

CDW-CE is implemented as a drop-in replacement for cross-entropy in any softmax-based classifier. During training, all non-true class probabilities are gathered per sample, and their log-loss is weighted by the corresponding distance power.

Sample implementation in PyTorch-style pseudocode:

def cdw_ce_loss(logits, targets, alpha=5, margin=0.0, eps=1e-9):
    probs = softmax(logits, dim=1)
    B, N = probs.shape
    class_indices = torch.arange(N, device=probs.device).unsqueeze(0)
    targets_i = targets.view(B,1)
    distances = (class_indices - targets_i).abs().float().pow(alpha)
    if margin > 0:
        probs = torch.clamp(probs + margin, max=1.0)
    mask = torch.ones_like(probs)
    mask[torch.arange(B), targets] = 0.0
    loss = -(distances * torch.log(1.0 - probs + eps) * mask).sum(dim=1).mean()
    return loss

Integration into standard deep network training loops is direct and compatible with all typical regularization and data augmentation procedures (Polat et al., 2022, Polat et al., 2 Dec 2024).

3. Empirical Performance and Comparative Analysis

CDW-CE has been systematically evaluated on the LIMUC dataset of ulcerative colitis severity, containing 11,276 images labeled across four Mayo Endoscopic Score levels. Models trained with CDW-CE were compared against CE, squared error (MSE), and several state-of-the-art ordinal regression losses (CORN, CO $_2$ , HO $_2$ ) using CNN backbones (ResNet-18, Inception-v3, MobileNet-v3-Large).

Model / Metric	CE	MSE	CORN	CO $_2$	HO $_2$	CDW-CE
ResNet-18 QWK	0.8296±0.014	0.8540±0.007	0.8366±0.007	0.8394±0.009	0.8446±0.007	0.8568±0.010
Inception-v3 QWK	0.8360±0.011	0.8517±0.007	0.8431±0.009	0.8482±0.009	0.8458±0.010	0.8678±0.006
MobileNet-v3 QWK	0.8302±0.015	0.8467±0.005	0.8412±0.010	0.8354±0.009	0.8378±0.007	0.8588±0.006

In binary remission classification (MES 0–1 vs. 2–3), CDW-CE also outperformed CE in Cohen’s Kappa and F1.
t-SNE visualization and Silhouette scores of embeddings indicate improved class separability (Silhouette: CE 0.121, CDW-CE 0.222), suggesting the loss effectively encourages more discriminative feature learning (Polat et al., 2 Dec 2024).

Confusion matrix analyses reveal that CDW-CE reduces errors between distant classes and localizes most remaining errors to adjacent classes.

4. Qualitative Evaluation: Latent Space and Interpretability

Class Activation Maps (CAM), visualized using methods by Zhou et al. (2016), consistently show that CDW-CE-trained models attend to broader, clinically significant tissue regions compared to the tighter and sometimes less interpretively relevant focus of cross-entropy-trained CAMs.

Structured expert evaluation using paired CAM overlays found:

CDW-CE-generated CAMs were preferred as more clinically meaningful in 35–60% of paired comparisons (CE preferred in 17–18%; both equal for the remainder) (Polat et al., 2022, Polat et al., 2 Dec 2024).

Domain experts further noted that CDW-CE-trained attention maps often highlighted subtle pathological features consistent with clinical reasoning.

5. Relation to Alternative Ordinal Losses

Unlike squared Earth Mover’s Distance (EMD $^2$ ) losses (Hou et al., 2016), which penalize misclassifications using an explicit ground distance matrix (possibly learned during training), CDW-CE directly incorporates absolute class distance raised to the power $\alpha$ as the penalty term. Both frameworks generalize CE by considering the full class probability vector rather than focusing solely on the true class probability. However, CDW-CE is computationally less demanding, requiring only simple polynomial distance weighting, and does not require a precomputed or dynamically learned ground distance matrix.

Hybrid approaches, such as EMD $^2$ -regularized cross-entropy, further smooth the penalty landscape but entail more hyperparameter tuning and complexity.

6. Hyperparameter Tuning, Advantages, and Limitations

Key hyperparameters:

$\alpha$ : Typically chosen in $\{1, 3, 5, 7\}$ using validation metrics such as QWK or MAE; governs severity of penalties for distant class errors.
$m$ : Optional additive margin, recommended for improved cluster compactness; requires sweep over a fine grid in $[0, 0.1]$ .

Best practices:

Begin with $\alpha=1$ (linear penalty), test for improvement, then sweep higher values.
Monitor confusion matrices to detect over-penalization (errors skipping adjacent classes).
Apply a small log smoothing factor (e.g., $\epsilon=1$ \text{e} $-9$ ) for numerical stability.
Pair loss tuning with standard regularization to guard against instability at high $\alpha$ .

Advantages:

Respects the ordinal structure by penalizing errors proportional to their real severity.
Plug-and-play fitness; does not alter model architecture or label representation.
Enhances both quantitative performance and the clinical interpretability of learned models, as validated by domain experts (Polat et al., 2022, Polat et al., 2 Dec 2024).

Limitations:

Choice of $\alpha$ and margin $m$ is task-dependent and must be tuned.
Excessively large $\alpha$ may harm performance by discouraging correct adjacent-class predictions.
May introduce moderate additional training instability, remediable via learning rate adjustments and regularization.

7. Applications and Generalization

CDW-CE has been principally validated in endoscopic severity scoring for ulcerative colitis but generalizes readily to other ordinal digital pathology and medical imaging tasks, such as diabetic retinopathy grading or cancer staging with more than four class levels (Polat et al., 2022, Polat et al., 2 Dec 2024). The methodology is not specific to gastrointestinal datasets and is applicable to any scenario where ordinal class proximity is clinically or semantically meaningful.

For such tasks, standard recommendations include repeating $\alpha$ optimization, as the effective class-distance spectrum broadens with more ordinal levels. The method is compatible with any softmax network architecture and can be transparently adopted in future datasets with ordinal label structures.

CDW-CE loss formalizes ordinal misclassification cost in a minimally invasive yet mathematically principled way, offering direct improvement over categorical cross-entropy for tasks where ordinal error structure is intrinsic. Its efficacy is substantiated by both quantitative metrics and enhancements in clinical-model interpretability (Polat et al., 2022, Polat et al., 2 Dec 2024).