Decoy-Free FDR Estimator

Updated 2 October 2025

Decoy-free FDR estimation is a statistical method that avoids synthetic nulls by using empirical data, model-agnostic calibration, and mixture models to estimate error rates.
It leverages techniques such as empirical conditioning, local FDR estimation, and Rao–Blackwellization to provide conservative and interpretable uncertainty quantification.
The approach is applicable in proteomics, genomics, and high-dimensional data analysis, offering a robust alternative to conventional decoy-dependent methods.

A decoy-free false discovery rate (FDR) estimator is a statistical approach for estimating or controlling the FDR in multiple hypothesis testing without relying on artificially constructed decoy data or simulated nulls. Instead, estimation and inference are based on the structure of observed data, automatic statistical guarantees, model-agnostic calibration, or competition-based constructs, often leveraging conditioning, mixture models, or calibration methods that avoid or minimize explicit use of decoy sets. This class of techniques can provide FDR control, accurate uncertainty quantification, and interpretability in contexts where traditional target–decoy, knockoff, or permutation-based nulls are impractical or may not deliver the rigor or efficiency required.

1. Key Principles of Decoy-Free FDR Estimation

Decoy-free FDR methods replace the use of explicit decoy or knockoff variables—common in, for example, target–decoy competition (TDC) or model-X knockoff frameworks—with procedures exploiting either (a) fully empirical null estimation, (b) local adjustment to capture individual hypothesis uncertainty, or (c) calibration techniques that map model-derived confidence scores to well-calibrated posterior probabilities of correctness.

The conceptual foundation is that, given suitable conditioning or calibration, FDR or local FDR (LFDR) can be estimated or bounded in a way that is provably conservative and avoids the instability, overhead, or potential model mis-specification that may result from generating explicit decoy distributions. Decoy-free estimators retain non-negativity and conservatism by design, often by using conditional expectations, Rao–Blackwellization, or competitive density estimation.

Across implementations, a unifying theme is that the estimator either (i) directly estimates the probability that a discovery is a false positive using observed and/or calibrated data-derived features, or (ii) rigorously constructs bounds on the realized false discovery proportion without augmenting the data with synthetic nulls.

2. Methodological Strategies

Local False Discovery Rate via Empirical Conditioning

A prominent decoy-free approach is based on the estimation of the local FDR (LFDR), i.e., the posterior probability that a specific hypothesis is null given its test statistic or p-value. The core relation is

$\phi(t_i) = \Pr(A_i = 0 \mid T_i = t_i)$

where $A_i$ is the indicator for the $i$ th null and $T_i$ is its observed statistic. In the fully decoy-free construction (Bickel, 2011):

The nonlocal FDR (NFDR) over a region $\mathcal{T}$ is estimated without parametric modeling by maximum-likelihood or confidence-median correction.
The LFDR for an individual hypothesis is estimated by connecting its position in the sorted set of p-values to the NFDR, using a threshold tied to the p-value's rank. For instance, the estimator

$\phi(r_i;\Phi^*) = \begin{cases} \Phi^*(p_{(2r_i)}), & r_i \le N/2 \ 1,& r_i > N/2 \end{cases}$

is provably asymptotically conservative for independent, continuous p-values. A monotonicity adjustment ensures the LFDRs respect p-value ordering, eliminating the need for decoys (Bickel, 2011).

Model-Agnostic FDR Estimation via Calibration

In "Winnow," a deep learning-based framework for de novo peptide sequencing FDR estimation, a neural network calibrator is trained to map raw model scores and peptide-spectrum match (PSM)–specific features to calibrated probabilities. If $S$ is a calibrated confidence score for a PSM, the key calibration property,

$\Pr(C=1 \mid S=s) = s$

enables the non-parametric FDR estimator

$\widehat{\mathrm{FDR}}(\tau) = \frac{\sum_{i=1}^{n} (1 - s_i) \cdot I(s_i \geq \tau)}{\sum_{i=1}^{n} I(s_i \geq \tau)}$

where $s_i$ are calibrated confidences, $I(\cdot)$ is the indicator function, and the sum is over predictions above confidence threshold $\tau$ (Mabona et al., 29 Sep 2025). This formulation eliminates the need for a decoy peptide score distribution and directly links model calibration quality to FDR estimation accuracy.

Rao–Blackwellized Generic FDR Estimation

A general decoy-free estimator for arbitrary selection procedures (Lasso, graphical Lasso, stepwise regression) decomposes FDR variable-wise and constructs for each variable $j$

$\mathrm{hFDR}_j(D) = \mathbb{E}_{H_j} \left[ \frac{1\{ j \in \mathcal{R} \} }{ |\mathcal{R}| } \,\Big|\, S_j \right] \cdot \varphi_j(D)$

where $S_j$ is a sufficient statistic under $H_j$ , and $\varphi_j(D)$ is a normalized function of the p-value (e.g., $1\{p_j > \zeta\}/(1-\zeta)$ for a threshold $\zeta$ ) (Luo et al., 13 Aug 2024). This estimator is conservative in finite samples and does not rely on generating synthetic null variables.

3. Theoretical Guarantees and Bias Properties

Most decoy-free FDR estimators are designed to have nonnegative (conservative) bias, ensuring that the estimated FDR does not underestimate the true error rate under standard assumptions (independence, correct calibration or sufficient statistic choice). For example:

The local FDR estimator via empirical NFDR (Bickel, 2011) is asymptotically conservative under independence;
Calibration-based FDR (Mabona et al., 29 Sep 2025) attributes to each prediction an error equal to $1-s$ and so, if calibration is correct, averages provide unbiased FDR estimates.
Rao–Blackwellized hFDR (Luo et al., 13 Aug 2024) is strictly conservative due to its design, and the bootstrap procedure provided enables finite-sample uncertainty assessment.

Simulation studies in these frameworks demonstrate declining root mean squared error (RMSE) and strong agreement between estimated and true FDR as the sample size grows, with conservatism quantified as the proportion of estimates exceeding the true FDR (Bickel, 2011).

4. Practical Applications Across Disciplines

Decoy-free FDR estimation methods have been implemented and validated in a variety of high-dimensional applications:

In proteomics, Winnow provides FDR estimation for de novo peptide sequencing by learning experiment- and spectrum-specific calibration, supporting both zero-shot and dataset-specific scenarios. This approach closely tracks FDR values obtained from gold-standard database search (Mabona et al., 29 Sep 2025).
In genomics or variable selection, decoy-free estimators give interpretable per-discovery confidence without constructing knockoff or permutation null variables. Applications include Lasso variable selection, regression models, and networks in systems biology (Luo et al., 13 Aug 2024).
Simple automatic LFDR methodologies (Bickel, 2011) allow conservative error quantification even with very small numbers of hypotheses, such as small-batch proteomics or clinical biomarker panels.

These methods excel where explicit decoy generation is difficult, computationally expensive, or leads to unstable FDR estimates.

5. Comparison with Decoy-Dependent and Alternative Methods

Traditional FDR control approaches often invoke explicit decoy sets or permutation-based nulls (e.g., TDC or knockoff-filter frameworks), which can provide theoretical guarantees under exchangeability and independence. However, these can suffer from practical drawbacks:

Variability and estimation granularity are limited by the number of decoy constructions (Emery et al., 2019).
Generating valid knockoffs or decoys is sometimes computationally infeasible (especially in non-Euclidean or structured settings).
Model misspecification or deviations from the decoy generation mechanism may yield anti-conservative error rates.

Decoy-free approaches bypass or minimize these issues by working directly with calibrated confidences, empirical nulls (via density estimation or mixture modeling), or by leveraging the statistical structure through sufficient statistics and conditioning. The resulting estimators are often more stable and broadly applicable, while achieving similar or improved power and FDR control (Luo et al., 13 Aug 2024, Bickel, 2011, Mabona et al., 29 Sep 2025).

6. Calibration, Model Robustness, and Limitations

Calibration quality is critical for decoy-free estimators based on probabilistic scores; miscalibration readily leads to FDR misestimation (Mabona et al., 29 Sep 2025). As such, robust neural network calibration incorporating spectrum- and PSM-specific features is central to these methods’ success in complex scenarios (e.g., mass spectrometry with transfer learning or domain shift).

Assumptions about independence, correctness of sufficient statistics, or availability of reliable per-discovery confidence remain necessary for the theoretical guarantees of decoy-free estimators. In small-sample regimes or with highly dependent data, conservatism is maintained, but statistical efficiency may decline.

This suggests that decoy-free FDR estimation forms a robust and flexible statistical paradigm that, given appropriate calibration and model structure, enables reliable multiple testing correction in domains where traditional decoy-based or null-distribution-based approaches may be suboptimal, computationally intensive, or insufficiently rigorous.

7. Outlook and Integration with Modern Data Analysis

Recent developments point towards hybrid frameworks, combining the data efficiency and stability of decoy-free approaches with the interpretability and adaptability of competition-based FDR (e.g., target–decoy calibration or model-X knockoffs with analytic p-values) (Etourneau et al., 2022, Chang et al., 22 May 2025). Ongoing work seeks to further unify empirical Bayes, local FDR, and calibration-based lines of research, including adaptive tuning of null proportion estimators (Gao, 2023), mixture modeling for spectral identification (Peng et al., 2020), and privacy-preserving FDR procedures in high-dimensional regimes (Cai et al., 2023).

As deep learning and probabilistic scoring become pervasive in scientific data analysis, decoy-free FDR estimation—anchored in proper score calibration and principled, model-agnostic error quantification—will become increasingly central to reliable discovery and reproducible science.