Similarity-Distance-Magnitude (SDM) Activation
- The SDM activation function is a neural network calibration method that fuses similarity, distance, and magnitude to improve uncertainty awareness and robust classification.
- It computes similarity via exemplar depth, evaluates distance using quantile-based metrics, and adjusts logit magnitude for sharper decision boundaries.
- Empirical studies show that SDM enhances OOD detection and selective classification in applications like language modeling, image recognition, and verification.
The Similarity-Distance-Magnitude (SDM) activation function is an advancement in neural network output calibration and robustness that replaces or augments the classical softmax layer with a formulation explicitly incorporating three geometric signals: similarity to known training exemplars, distance to the training manifold, and the standard decision-boundary magnitude. By combining these notions within the activation and calibration pipeline, SDM activations provide enhanced epistemic uncertainty awareness, strong robustness to distribution shifts, and interpretable likelihoods directly grounded in exemplar neighborhoods. SDM activations have found utility across deep learning applications—including language modeling, classification, and verification—with demonstrated improvements in selective classification, out-of-distribution (OOD) rejection, and interpretability-by-exemplar (Schmaltz, 27 Feb 2025, Schmaltz, 16 Sep 2025, Schmaltz, 30 Oct 2025).
1. Mathematical Formulation and Components
Let denote an input, a frozen or fixed underlying model (e.g., a LLM or CNN), and a small, trainable 1D CNN-based adaptor producing feature representation . The SDM activation explicitly encodes:
- Similarity (): The depth or run-length of consecutive nearest neighbors in feature space that match the predicted class and ground-truth label (i.e., “depth-match”). For computation, all training set representations $\{h'_n^{\rm tr}\}$ are sorted by distance to , and is the maximal 0 such that the first 1 neighbors are both predicted and labeled as the current predicted class.
- Distance (2): The quantile location of the input’s nearest-neighbor distance relative to the class-conditional empirical cumulative distribution function (eCDF) on a calibration split. For class 3, 4 is the empirical CDF of nearest-neighbor distances for calibration points with 5. For test input:
6
This approaches zero for far-out-of-distribution points.
- Magnitude (7): The raw logit or decision-boundary margin produced by the final linear layer atop 8, analogous to the standard class logit.
The SDM activation for each class 9 is given by:
0
Softmax is recovered at 1, 2, yielding temperature 3. When 4 or 5, the distribution is uniform, signaling high epistemic uncertainty (Schmaltz, 27 Feb 2025, Schmaltz, 16 Sep 2025).
2. Theoretical Motivation and Distinction from Softmax
Traditional softmax activations only encode the relative magnitude of class logits (i.e., distance to the decision boundary, “aleatoric” uncertainty) and do not distinguish between well-populated regions of the feature space and OOD or adversarial regions. SDM addresses this by:
- Similarity-awareness (6): Quantifies the local density of correctly classified and predicted training points, serving as a local effective sample-size indicator. Low 7 signals lack of reliable exemplar support.
- Distance-awareness (8): Measures the quantile position of the input’s distance to the training manifold in the calibration set, identifying departures from the known distribution.
- Magnitude (9): Maintains the standard margin confidence.
Multiplicatively combining these (via exponentiating the logits with base 0 and scale 1) yields a sharply peaked distribution for well-supported in-distribution points, and a flatter/uniform distribution for OOD or low-support points. Epistemic uncertainties are directly encoded, providing robust confidence estimates that do not collapse in distribution-shifted or OOD settings (Schmaltz, 27 Feb 2025, Schmaltz, 16 Sep 2025, Schmaltz, 30 Oct 2025).
3. Training and Calibration Methodologies
SDM networks employ a two-stage pipeline:
- Adaptor Training: The model trains a 1D-CNN-based adaptor (parameters 2), linear head (3), and calibration structures by minimizing the negative log-likelihood in the SDM base:
4
After each epoch, 5 and 6 are recomputed and frozen for the next epoch. Early stopping is triggered when the balanced calibration loss plateaus.
- Calibration/Selective Classification: A held-out calibration set is used to build class-conditional eCDFs for distance statistics and SDM outputs. Thresholds (e.g., 7 for each class and HR region 8) are found such that the class-conditional accuracy among accepted points exceeds a target (e.g., 9). DKW bounds are applied to quantify finite-sample uncertainty, and conservative region-specific thresholds are computed to achieve guaranteed accuracy while maximizing acceptance rate (Schmaltz, 27 Feb 2025, Schmaltz, 16 Sep 2025).
- Inference: For test input, the model computes 0, obtains 1, 2, 3, and calculates SDM probabilities. Using calibration-derived thresholds, a selective classifier admits a prediction only if both 4 and SDM output exceed their respective class-conditional thresholds; otherwise, the input is rejected (abstained) (Schmaltz, 16 Sep 2025, Schmaltz, 27 Feb 2025).
4. Empirical Performance and Applications
SDM activations have been empirically validated on language modeling, closed-box binary classification, fact-checking, and LLM sequence verification (Schmaltz, 27 Feb 2025, Schmaltz, 16 Sep 2025, Schmaltz, 30 Oct 2025). Salient findings include:
- OOD Robustness: On out-of-distribution datasets (e.g., IMDb sentiment OOD, fact-check benchmarks), SDM is the only method that achieves 5 accuracy on admitted points while rejecting the remainder, outperforming softmax, temperature scaling, and conformal calibrators.
- Coverage vs. Accuracy Trade-off: SDM achieves high accuracy on selective predictions while maintaining reasonable acceptance rates. For example, SDM achieves [1.00, 1.00] class-conditional accuracy on ≈1% of OOD-admitted IMDb examples, where other methods either admit too many OOD points or achieve low coverage.
- Fine-tuning SDM LMs: Fine-tuning decoder-only Transformers with the SDM next-token loss and contrastive sampling yields increased statistical efficiency (lower abstention rates) and maintains ≥95% accuracy in high-probability acceptance regions.
- Data Quality Discovery: SDM’s strict acceptance criteria highlight label errors in calibration/test splits, surfacing genuine annotation mistakes (Schmaltz, 27 Feb 2025).
- Black-box LLM Calibration: Applying SDM activations atop LLM output logits enables reliable selective generation and region-specific coverage guarantees, without requiring retraining the core backbone (Schmaltz, 16 Sep 2025, Schmaltz, 27 Feb 2025).
5. Interpretability and Exemplar-based Explanations
Accepted points under SDM admit direct interpretability via their depth-matching exemplars: for each test input, the first 6 training points forming the depth-match can be retrieved and visualized, all sharing the predicted class label and prediction. This enables “interpretability-by-exemplar” and partitioning of the acceptance region by similarity, distance, and calibration statistics. The mechanism offers richer, instance-grounded explanations than standard softmax-based confidence, which lacks any link to concrete training exemplars (Schmaltz, 16 Sep 2025, Schmaltz, 30 Oct 2025).
6. Practical Integration and Implementation
SDM activation is integrated as a lightweight, final-layer module atop a frozen or fine-tuned backbone. Key practical guidelines include:
- Index and cache all training embeddings for efficient nearest-neighbor search (with FAISS or Annoy for large datasets).
- Attach a 1D-CNN adaptor with up to 1000 filters, followed by a linear layer; batch sizes ∼50 and Adam learning rates of 7 are typical.
- For high-confidence selective classification, use class-conditional thresholds found via calibration-set eCDFs and set an acceptance risk parameter 8 as desired.
- In LLMs and Transformers, the SDM module operates on penultimate representations; in CNNs, it operates on global pooled embeddings.
- The dominant computational cost is nearest-neighbor indexing; all other operations are negligible relative to ML inference time (Schmaltz, 16 Sep 2025, Schmaltz, 30 Oct 2025).
7. Comparison to Related Metric and Activation Functions
SDM extends and generalizes the notion of RBF or prototype-based similarity in neural classification (Amirian et al., 2022). In contrast to the quadratic RBF kernel 9, which encodes similarity via Mahalanobis distance to class centers, SDM combines similarity depth, distance quantile, and margin in a single multiplicative activation. Unlike contrastive, triplet, or margin-based metric learning—which require explicit mining and margin balancing—SDM “bakes in” all three epistemic signals at the activation/output level, avoids complex sampling schemes, and supports end-to-end calibration and selective classification. Unlike Gaussian or multiquadric RBFs, SDM’s formulation directly supports instance-level calibrations and achieves robust OOD detection without vanishing gradients or exponential attenuation (Amirian et al., 2022, Schmaltz, 27 Feb 2025, Schmaltz, 16 Sep 2025).
Key References:
- "Similarity-Distance-Magnitude Universal Verification" (Schmaltz, 27 Feb 2025)
- "Similarity-Distance-Magnitude Activations" (Schmaltz, 16 Sep 2025)
- "Similarity-Distance-Magnitude LLMs" (Schmaltz, 30 Oct 2025)
- "Radial Basis Function Networks for Convolutional Neural Networks to Learn Similarity Distance Metric and Improve Interpretability" (Amirian et al., 2022)