- The paper introduces a theoretical framework based on maximizing expected gain (MEG) and a generalized centroid estimator called the gamma-centroid estimator to design bioinformatics predictors that align with specific accuracy measures.
- The gamma-centroid estimator maximizes the expected value of true negatives plus gamma times true positives and is equivalent to summing marginalized probabilities exceeding a threshold determined by gamma.
- This framework facilitates efficient computation using dynamic programming for problems like sequence alignment and RNA structure prediction and applies to different types of estimation problems.
The paper introduces a theoretical framework for designing estimators in bioinformatics, particularly for problems involving high-dimensional binary spaces, and presents a generalized centroid estimator called the γ-centroid estimator. The motivation stems from the observation that many existing estimators, such as maximum likelihood (ML) estimators and centroid estimators, often don't align well with commonly used accuracy measures in bioinformatics. The paper aims to address this discrepancy by providing a unified approach based on the principle of maximum expected accuracy (MEA).
The paper formalizes various estimation problems in bioinformatics, such as pairwise sequence alignment and RNA secondary structure prediction, as instances of a general estimation problem on a binary space. It introduces the concept of a maximum expected gain (MEG) estimator, which maximizes an estimator-specific gain function with respect to a given probability distribution. The gain function represents the reward for making a particular prediction, and by designing the gain function appropriately, the MEG estimator can be tailored to specific accuracy measures.
The γ-centroid estimator is introduced as a generalization of the centroid estimator. It maximizes the expected value of TN+γTP, where TN is the number of true negatives, TP is the number of true positives, and γ is a parameter that adjusts the balance between the gain from true negatives and true positives. The paper demonstrates that the MEG estimator for a gain function that is a linear combination of TP, TN, FP (false positives), and FN (false negatives) is equivalent to a γ-centroid estimator with a specific value of γ. Specifically, γ=α2+α3α1+α4, where α1, α2, α3, and α4 are positive constants weighting TP, TN, FP, and FN, respectively.
The paper provides a theorem stating that under certain conditions on the predictive space, the γ-centroid estimator is equivalent to the estimator that maximizes the sum of marginalized probabilities pi that are greater than 1/(γ+1) in the prediction.
pi=p(θi=1∣D) represents the marginalized probability of the i-th dimension of the predictive space being 1, given the data D.
For cases where γ∈[0,1], the paper presents a corollary showing that the γ-centroid estimator contains its consensus estimator, which is an estimator that independently predicts each dimension based on whether its marginalized probability is greater than 1/(γ+1).
The paper also discusses how to efficiently compute the γ-centroid estimator for specific problems such as pairwise sequence alignment and RNA secondary structure prediction, using dynamic programming algorithms.
In addition to standard estimation problems where the probability distribution is defined on the predictive space, the paper introduces a new category of estimation problems where the probability distribution is defined on a parameter space that differs from the predictive space. Two types of estimators for such problems are discussed: estimators for representative prediction and estimators based on marginalized distributions.
For representative prediction, the paper introduces the concept of a homogeneous generalized gain function, which integrates the gain for each parameter. It shows that a representative prediction problem with any homogeneous generalized gain function can be solved similarly to a standard estimation problem with an averaged probability distribution.
For estimation problems involving marginalized distributions, the paper discusses the challenge of calculating the marginalized distribution in actual problems and introduces the concept of an approximated γ-type estimator to reduce computational costs. This estimator uses a factorized probability distribution and a γ-type pointwise gain function to reduce inconsistencies caused by the factorization.
The paper concludes by discussing the properties of the γ-centroid estimator, including its relationship to existing estimators and its ability to balance sensitivity and positive predictive value. It also addresses the computational efficiency of the γ-centroid estimator and its applicability to various problems in bioinformatics.
Key definitions from the paper:
- Bayesian ML estimator: y^(ML)=argmaxy∈Yp(y∣D)
- Gain function: G:Y×Y→R+
- MEG estimator: y^(MEG)=y∈Yargmax∫G(θ,y)p(θ∣D)dθ
- Pointwise gain function: G(θ,y)=∑i=1nFi(θ,yi)
- Consensus estimator: y^i(c)=yi∈{0,1}argmaxEθ∣D[Fi(θ,yi)]
- γ-centroid estimator: MEG estimator with Fi(θ,yi)=I(θi=0)I(yi=0)+γI(θi=1)I(yi=1)