Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 81 tok/s
Gemini 2.5 Pro 44 tok/s Pro
GPT-5 Medium 22 tok/s Pro
GPT-5 High 25 tok/s Pro
GPT-4o 81 tok/s Pro
Kimi K2 172 tok/s Pro
GPT OSS 120B 434 tok/s Pro
Claude Sonnet 4 37 tok/s Pro
2000 character limit reached

Selection-Adjusted Empirical Bayes Confidence

Updated 19 September 2025
  • The paper introduces selection-adjusted methodologies that correct for double-use bias in empirical Bayes LFDR estimation using MDL, L1O, and L½O techniques.
  • It demonstrates through simulations and real protein abundance studies that estimator choice critically impacts bias control and inference accuracy.
  • It recommends an optimally weighted hybrid (MDL-BBE) estimator to balance bias reduction and robustness under varying sparsity and signal intensity conditions.

Selection-adjusted empirical Bayes confidence procedures comprise a set of methodologies for constructing confidence intervals or regions in settings where the parameter(s) of interest are selected from many candidates based on observed data, and where information is borrowed across units via empirical Bayes ideas. These procedures explicitly adjust for the bias and over-optimism that typically arise from such selection, especially in high-dimensional or multipopulation studies, aiming to ensure that uncertainty quantification remains valid after the selection step. Methods in this area span from corrected maximum likelihood estimation of local false discovery rates (LFDR) for small and medium-scale data, to interval construction that balances selective coverage, efficiency, and width, often using theoretical frameworks such as the minimum description length (MDL) principle.

1. Fundamental Corrections for LFDR Estimation

A core challenge addressed by selection-adjusted empirical Bayes confidence procedures is the "double-use" bias inherent in conventional empirical Bayes maximum likelihood estimators (MLEs) of the LFDR, particularly when the number of features (e.g., genes, proteins) is small. The unbiased local false discovery rate at feature i, denoted ψi\psi_i, is the posterior probability that the null hypothesis is true given the observed data. In the traditional "leave-zero-out" (L0O) MLE approach, all features, including the ith, are used both to estimate the prior parameters (such as π0\pi_0 and θalt\theta_{\text{alt}}) and to compute the likelihood at tit_i:

θ^L0O,π^0L0O=argmaxθ,π0j=1N[π0gθ0(tj)+(1π0)gθ(tj)]\langle\hat{\theta}^{\text{L0O}}, \hat{\pi}_0^{\text{L0O}}\rangle = \underset{\theta, \pi_0}{\arg\max}\prod_{j=1}^N [\pi_0\,g_{\theta_0}(t_j) + (1-\pi_0)\,g_\theta(t_j)]

ψ^iL0O=π^0L0Ogθ0(ti)π^0L0Ogθ0(ti)+(1π^0L0O)gθ^L0O(ti)\hat{\psi}_i^{\text{L0O}} = \frac{\hat{\pi}_0^{\text{L0O}}\,g_{\theta_0}(t_i)}{\hat{\pi}_0^{\text{L0O}}\,g_{\theta_0}(t_i) + (1-\hat{\pi}_0^{\text{L0O}})\,g_{\hat{\theta}^{\text{L0O}}}(t_i)}

This approach creates a substantial negative bias, especially in small NN settings, due to the use of tit_i both in parameter estimation and in inference for tit_i itself.

To address this, the paper introduces several adjusted estimators:

  • MDL (Minimum Description Length) Estimator: Excludes tit_i entirely from parameter estimation:

θ^iMDL,π^0iMDL=argmaxθ,π0ji[π0gθ0(tj)+(1π0)gθ(tj)]\langle\hat{\theta}_i^{\text{MDL}}, \hat{\pi}_{0i}^{\text{MDL}}\rangle = \underset{\theta, \pi_0}{\arg\max}\prod_{j\neq i} [\pi_0\,g_{\theta_0}(t_j) + (1-\pi_0)\,g_\theta(t_j)]

then computes

ψ^iMDL=π^0iMDLgθ0(ti)π^0iMDLgθ0(ti)+(1π^0iMDL)gθ^iMDL(ti)\hat{\psi}_i^{\text{MDL}} = \frac{\hat{\pi}_{0i}^{\text{MDL}}\,g_{\theta_0}(t_i)}{\hat{\pi}_{0i}^{\text{MDL}}\,g_{\theta_0}(t_i) + (1-\hat{\pi}_{0i}^{\text{MDL}})\,g_{\hat{\theta}_i^{\text{MDL}}}(t_i)}

  • Leave-one-out (L1O): Removes tit_i only for estimating alternative hypothesis parameters.
  • Leave-half-out (L½O): Downweights tit_i's contribution in estimation by a factor ν=1/2\nu=1/2, allowing partial information retention.

This correction ensures that, for each feature, the LFDR estimate is less biased with respect to selection, i.e., is "selection-adjusted".

2. Bias Reduction and Small-Scale Applicability

By omitting tit_i or reducing its influence in parameter estimation, the MDL and related corrected MLE approaches substantially reduce the double-use bias. This improvement is most critical in small and medium-scale studies (e.g., conventional gene expression, proteomics, metabolomics), where conventional empirical Bayes procedures can be badly biased. Simulations in the paper demonstrate:

  • Corrected MLEs (especially MDL and L1O) provide low bias when the fraction of affected features is moderate to large.
  • When only a solitary feature is affected, completely omitting tit_i (as in MDL or L1O) can result in high positive bias, which is mitigated by L½O.
  • All corrected MLEs suffer strong negative bias when over 90% of features are unaffected, underscoring the necessity of bias correction in sparse signal settings and the challenge of unknown sparsity in practice.

3. Empirical Comparison in Protein Abundance Studies

The bias reduction properties are directly observed in real data. Analysis of a 20-protein abundance dataset from breast cancer studies reveals that the (unadjusted) L0O, MDL, and L1O yield very different decisions about which proteins are "affected." Notably:

  • Uncorrected methods, due to double-use, can indicate spuriously low LFDRs for extreme values (leading to possible false discovery).
  • Conservative estimators (such as the binomial-based estimator or its r-value approximation) have positive bias, particularly in nearly-null settings.
  • As shown in volcano and LFDR-vs-p-value plots, the inference about which proteins are differentially abundant is highly sensitive to the choice of estimator.

Thus, the estimator used directly impacts downstream scientific decision making, reinforcing the importance of well-calibrated, selection-adjusted estimators.

4. Simulation Outcomes and the MDL-BBE Combination

Through extensive simulations varying feature count, effect detectability, and number of non-null features, the following findings emerge:

  • All corrected MLEs (including MDL) show notable negative bias when all features are null; conservative estimators are less biased in this case.
  • When several features are affected, conservative estimators (BBE, RV) are positively biased, but MDL and L1O yield low bias.
  • No single estimator uniformly dominates all scenarios.

To address these trade-offs, the paper introduces an optimally weighted estimator (MDL-BBE) that linearly combines the MDL estimator (low bias in dense settings) with the binomial-based estimator (robust in extremely sparse settings). This hedging achieves lower worst-case bias across possible realities regarding the proportion of affected features.

5. Practical Recommendations for Real-World Data Analysis

Because the true proportion of affected features is unknown in real data, and since no method is uniformly best across all scenarios, it is recommended to use an optimally weighted combination of the best-corrected MLE (typically MDL) and a conservative estimator (such as BBE or RV). The optimal linear weights are chosen based on minimax considerations described in the text (Section 3.2.2).

This hybrid approach ensures protection against the risks of negative bias (false discoveries) and excessive conservativeness (failure to discover real effects) under uncertainty about actual sparsity/density of signals.

6. Minimum Description Length Principle and Theoretical Foundations

The theoretical motivation for the best-performing correction (MDL estimator) relies on the minimum description length principle. By minimizing the codelength log-\log probability of observed data (excluding tit_i for feature ii), the MDL estimator provides a natural information-theoretic adjustment for selection effects. This framework corresponds to using out-of-sample log-likelihood for parameter estimation, thereby avoiding overfitting to selected observations.

Appendices in the paper further connect the MDL estimator to minimax coding theory and establish its asymptotic and universality properties, supporting its use as a rigorously justified selection-adjusted empirical Bayes estimator.


In summary, selection-adjusted empirical Bayes confidence procedures grounded in MDL-corrected MLEs and optimal estimator combination provide low-bias, theoretically motivated, and practically robust solutions for inference in small-to-moderate scale studies where selection-induced bias is non-negligible. The key innovation is the principled avoidance of double-counting information, direct control of bias under varying sparsity, and rigorous justification via coding-theoretic arguments, culminating in improved reliability of interval estimation and hypothesis testing in real-world biological applications (Padilla et al., 2010).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Selection-Adjusted Empirical Bayes Confidence Procedures.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube