Threshold Classifiers in Machine Learning

Updated 14 May 2026

Threshold classifiers are predictive models that convert continuous score outputs into discrete class labels via decision thresholds.
They employ methods like grid search, MILP, and quantile-based tuning to optimize performance metrics such as accuracy, F1 score, and cost minimization.
Applied in areas like fraud detection, deep learning, and fairness-aware systems, they adapt thresholds to address class imbalance and adversarial challenges.

A threshold classifier is a predictive model that assigns discrete class labels by comparing a continuous score or probability estimate to one or more decision thresholds. The canonical structure is: generate a real-valued output from a learned function, such as the posterior probability of a class, and assign a label by thresholding this value. Thresholding is a foundational concept in binary and multiclass classification, cost-sensitive learning, constrained optimization, score calibration, outlier and adversarial detection, fairness adjustment, and model interpretability. It arises both in classical statistical decision theory and virtually all modern machine learning workflows.

1. Mathematical Formulations and General Principles

Let $f(x)$ denote the score produced by a trained classifier for an instance $x$ . A threshold classifier with threshold $\theta$ predicts

$\hat{y}(x;\theta) = \begin{cases} +1 & \text{if } f(x) \geq \theta\ -1 & \text{otherwise} \end{cases}$

in the binary case. For multiclass models (e.g., softmax outputs $p(x) = (p_1, ..., p_K)$ ), the decision may generalize to selecting the class $j$ where $p_j - p_k \geq \tau_j - \tau_k$ for all $k \neq j$ under a parameterized threshold vector $\tau$ on the simplex (Marchetti et al., 16 May 2025).

Given true class priors $\pi_+, \pi_-$ and costs $x$ 0 for the two error types, the expected cost at threshold $x$ 1 is

$x$ 2

(Tian et al., 2018, Omar et al., 2019).

For score-calibrated binary classifiers, the optimal threshold to maximize a target metric (e.g., $x$ 3) or minimize cost is often computed by analytic or empirical optimization over the observed score distribution (Lipton et al., 2014, Hong et al., 2016).

2. Algorithmic Threshold Selection and Optimization Methods

There is a spectrum of threshold selection methodologies, depending on the objective, data regime, constraints, and problem structure:

Empirical Grid Search: Scan over possible thresholds on validation data (often sorted by unique scores) to maximize a metric (accuracy, $x$ 4, AUC) (Lipton et al., 2014).
Order-Statistic-based Methods: Choose the threshold as an order statistic to optimize expected cost, with formal guarantees (e.g., the THORS algorithm) (Tian et al., 2018).
Mixed Integer Linear Programming (MILP): Formulate threshold selection as a small MILP to optimize arbitrary linear metrics and constraints (volume, sensitivity, FPR, etc.), both globally and on subgroups (Koseoglu et al., 2024).
Quantile-based Thresholding: Parameterize the threshold as the $x$ 5-quantile of the scores, converting rate-constrained problems into unconstrained surrogates that are differentiable, compatible with SGD, and theoretically guaranteed (Mackey et al., 2018).
A Posteriori Tuning: In deep networks, thresholds can be tuned post hoc on the simplex (for multiclass) to maximize macro-F1, accuracy, or other scores, implemented as an efficient search over the simplex (grid or Monte Carlo) (Marchetti et al., 16 May 2025).
Group-specific/Fairness-aware Thresholds: Learn per-group thresholds to minimize performance disparities (e.g., balanced error rate, demographic parity subject to group accuracy constraints) (Jung et al., 6 Feb 2025).

Method	Scope	Guarantees/Properties
Grid/empirical search	Any classifier, any metric	No formal optimality, but widely adopted
Order statistics (THORS)	Binary, cost-sensitive	Statistical cost bounds, $x$ 6 runtime
MILP (OTLP)	Model-agnostic, multiclass, subspace	Provably optimal wrt. the MILP, flexible constraints
Quantile (quantile-SGD)	Any differentiable model	Uniform convergence rate, no dual variables
A posteriori simplex grid	Deep multiclass nets	Exhaustive over simplex $x$ 7, tractable for $x$ 8
Group-adaptive (FairOPT)	Any model, per-group	Empirically robust fairness improvements

3. Role in Multiclass, Imbalanced, and Score-constrained Tasks

Multiclass threshold classifiers are formalized by generalizing the argmax rule on the $x$ 9-simplex. The region boundaries of each class are shifted by a parameter $\theta$ 0; the argmax rule is the special case $\theta$ 1 (Marchetti et al., 16 May 2025). Adjusting $\theta$ 2 allows direct control over class trade-offs, particularly in unbalanced data—tuning $\theta$ 3 yields consistent performance gains, especially boosting minority class metrics.

Imbalanced data domains, such as rare-event detection, benefit from adapting the threshold away from the default (e.g., $\theta$ 4) towards the minority class proportion. For a linear classifier with class-imbalance $\theta$ 5, the optimal threshold is $\theta$ 6, restoring balanced error rates (Hong et al., 2016). This principle extends to decision trees via adaptive Rènyi entropy, and to non-linear models via cost-sensitive threshold selection or reweighting.

Constraint-driven regimes (e.g., enforcing a maximum false-positive rate, or fixed positive prediction rate for recall-at-K) convert constraints into quantile conditions for threshold selection, sidestepping dual variable optimization. The differentiable quantile surrogate framework enables scalable stochastic-gradient methods and maintains exact constraint satisfaction (Mackey et al., 2018).

4. Statistical Theory and Optimality Results

The theoretical optimality of threshold rules is supported by multiple lines of analysis:

Minimization of Expected Cost: The optimal Bayes classifier compares the likelihood ratio to a threshold set by prior, cost, and metric (Omar et al., 2019, Tian et al., 2018). This extends to cost-sensitive learning (THORS) and order-statistics-based selection (Tian et al., 2018).
Maximizing Non-decomposable Metrics: The $\theta$ 7-maximizing rule assigns positives to all scores $\theta$ 8 exceeding $\theta$ 9 (under calibration), where $\hat{y}(x;\theta) = \begin{cases} +1 & \text{if } f(x) \geq \theta\ -1 & \text{otherwise} \end{cases}$ 0 is the achievable $\hat{y}(x;\theta) = \begin{cases} +1 & \text{if } f(x) \geq \theta\ -1 & \text{otherwise} \end{cases}$ 1 for the classifier (Lipton et al., 2014). This principle carries to macro or micro $\hat{y}(x;\theta) = \begin{cases} +1 & \text{if } f(x) \geq \theta\ -1 & \text{otherwise} \end{cases}$ 2 in multilabel settings.
Performative Settings: When behavior is endogenous to the classifier, as in outcome performativity, the Bayes-optimal classifier remains a threshold rule or its negative (assign the positive class to low-scored individuals) (Penn, 8 Apr 2025). Under certain priors and signal models, the negative threshold may even yield higher overall accuracy due to induced changes in prevalence and separability.
Game-theoretic/Adversarial Contexts: In adversarial classification, Bayesian-Nash equilibrium classifiers may be mixed over parametric threshold rules, constructed through the properties of the corresponding min-max solution—enabling scalable learning even for exponential hypothesis classes (Loiseau et al., 2021).

5. Practical Applications and Empirical Findings

Threshold classifiers are widely used across domains:

Deep multiclass networks: A posteriori threshold tuning on the softmax simplex systematically boosts both accuracy and macro-F1, especially in unbalanced data. MultiSOL (score-oriented loss) further refines training for direct metric calibration (Marchetti et al., 16 May 2025).
Imbalanced fraud detection: MILP-based threshold selection (OTLP) increases fraud F1 from baseline 0.83 to 0.86 (XGBoost, $\hat{y}(x;\theta) = \begin{cases} +1 & \text{if } f(x) \geq \theta\ -1 & \text{otherwise} \end{cases}$ 3), robustly supporting additional business constraints (Koseoglu et al., 2024).
Cost-sensitive learning: THORS reduces misclassification cost by up to 80% over empirical or naive thresholding and does so $\hat{y}(x;\theta) = \begin{cases} +1 & \text{if } f(x) \geq \theta\ -1 & \text{otherwise} \end{cases}$ 4– $\hat{y}(x;\theta) = \begin{cases} +1 & \text{if } f(x) \geq \theta\ -1 & \text{otherwise} \end{cases}$ 5 faster (Tian et al., 2018).
Zero-shot prompt-based text classifiers: Priors on the marginal class probabilities are matched via unsupervised or zero-resource reweighting, and thresholding on max posterior (default $\hat{y}(x;\theta) = \begin{cases} +1 & \text{if } f(x) \geq \theta\ -1 & \text{otherwise} \end{cases}$ 6) eliminates label-word bias and narrows the gap with oracle-tuned thresholds (Liusie et al., 2023).
Fairness-aware text detection: Group-wise threshold optimization reduces balanced error rate gaps between groups by $\hat{y}(x;\theta) = \begin{cases} +1 & \text{if } f(x) \geq \theta\ -1 & \text{otherwise} \end{cases}$ 7 on average with negligible accuracy trade-off, reshaping the Pareto frontier for fair classification (Jung et al., 6 Feb 2025).

6. Limitations, Common Issues, and Model Selection

Several factors complicate thresholding in practice:

Discretization noise: Artificial thresholding of a continuous outcome introduces ambiguity near the threshold, significantly affecting class-specific metrics (precision, recall, F1), though balanced metrics (AUC, MCC) are robust. Feature importance interpretations are stable for the top $\hat{y}(x;\theta) = \begin{cases} +1 & \text{if } f(x) \geq \theta\ -1 & \text{otherwise} \end{cases}$ 8 features but not for lower ranks. Best practice: estimate and exclude the noisy band via incremental removal (Rajbahadur et al., 2022).
Sampling error and overfitting: Empirical maximization of non-decomposable metrics (e.g., $\hat{y}(x;\theta) = \begin{cases} +1 & \text{if } f(x) \geq \theta\ -1 & \text{otherwise} \end{cases}$ 9) can "over-select" positives for rare classes, leading to pathological behaviors—such as predicting almost all examples as positive for rare uninformative labels (Lipton et al., 2014).
Complexity and scalability: Some approaches (e.g., MILP) face intractability for very large candidate sets; group-adaptive thresholding may overfit if subgroups are small; sample average approximation is required for scalability in adversarial scenarios (Koseoglu et al., 2024, Jung et al., 6 Feb 2025, Loiseau et al., 2021).

7. Extensions and Advanced Topics

Threshold classifiers admit extensions into ensemble and kernel regimes (BTC/KBTC), hybridization with abstain systems to control confusion set sizes (Toksöz, 2017, Srinivasan, 2017), multiclass simplex-geometry, and robust learning under adversarial uncertainty. Parameter selection frameworks based on sufficient identification conditions allow for efficient operating point estimation without costly cross-validation (Toksöz, 2017).

Further generalizations concern multiclass quantile programming, group-fairness under intersectional constraints, and integration with modern uncertainty calibration or OOD detection schemes, where confidence-calibrated thresholding is a central post-training adjustment (Lee et al., 2017).

Key references:

"Multiclass threshold-based classification" (Marchetti et al., 16 May 2025)
"Constrained Classification and Ranking via Quantiles" (Mackey et al., 2018)
"Optimal classification with endogenous behavior" (Penn, 8 Apr 2025)
"OTLP: Output Thresholding Using Mixed Integer Linear Programming" (Koseoglu et al., 2024)
"Thresholding Classifiers to Maximize F1 Score" (Lipton et al., 2014)
"THORS: An Efficient Approach for Making Classifiers Cost-sensitive" (Tian et al., 2018)
"Dealing with Class Imbalance using Thresholding" (Hong et al., 2016)
"Group-Adaptive Threshold Optimization for Robust AI-Generated Text Detection" (Jung et al., 6 Feb 2025)
"Scalable Optimal Classifiers for Adversarial Settings under Uncertainty" (Loiseau et al., 2021)