Generalized Centroid Estimators in Bioinformatics

Published 19 May 2013 in q-bio.QM and cs.LG | (1305.4339v1)

Abstract: In a number of estimation problems in bioinformatics, accuracy measures of the target problem are usually given, and it is important to design estimators that are suitable to those accuracy measures. However, there is often a discrepancy between an employed estimator and a given accuracy measure of the problem. In this study, we introduce a general class of efficient estimators for estimation problems on high-dimensional binary spaces, which representmany fundamental problems in bioinformatics. Theoretical analysis reveals that the proposed estimators generally fit with commonly-used accuracy measures (e.g. sensitivity, PPV, MCC and F-score) as well as it can be computed efficiently in many cases, and cover a wide range of problems in bioinformatics from the viewpoint of the principle of maximum expected accuracy (MEA). It is also shown that some important algorithms in bioinformatics can be interpreted in a unified manner. Not only the concept presented in this paper gives a useful framework to design MEA-based estimators but also it is highly extendable and sheds new light on many problems in bioinformatics.

Abstract PDF Upgrade to Chat

Authors (4)

Citations (15)

View on Semantic Scholar

Summary

The paper introduces a theoretical framework based on maximizing expected gain (MEG) and a generalized centroid estimator called the gamma-centroid estimator to design bioinformatics predictors that align with specific accuracy measures.
The gamma-centroid estimator maximizes the expected value of true negatives plus gamma times true positives and is equivalent to summing marginalized probabilities exceeding a threshold determined by gamma.
This framework facilitates efficient computation using dynamic programming for problems like sequence alignment and RNA structure prediction and applies to different types of estimation problems.

The paper introduces a theoretical framework for designing estimators in bioinformatics, particularly for problems involving high-dimensional binary spaces, and presents a generalized centroid estimator called the $\gamma$ -centroid estimator. The motivation stems from the observation that many existing estimators, such as maximum likelihood (ML) estimators and centroid estimators, often don't align well with commonly used accuracy measures in bioinformatics. The paper aims to address this discrepancy by providing a unified approach based on the principle of maximum expected accuracy (MEA).

The paper formalizes various estimation problems in bioinformatics, such as pairwise sequence alignment and RNA secondary structure prediction, as instances of a general estimation problem on a binary space. It introduces the concept of a maximum expected gain (MEG) estimator, which maximizes an estimator-specific gain function with respect to a given probability distribution. The gain function represents the reward for making a particular prediction, and by designing the gain function appropriately, the MEG estimator can be tailored to specific accuracy measures.

The $\gamma$ -centroid estimator is introduced as a generalization of the centroid estimator. It maximizes the expected value of $TN + \gamma TP$ , where $TN$ is the number of true negatives, $TP$ is the number of true positives, and $\gamma$ is a parameter that adjusts the balance between the gain from true negatives and true positives. The paper demonstrates that the MEG estimator for a gain function that is a linear combination of $TP$ , $TN$ , $FP$ (false positives), and $FN$ (false negatives) is equivalent to a $\gamma$ -centroid estimator with a specific value of $\gamma$ . Specifically, $\gamma=\frac{\alpha_1+\alpha_4}{\alpha_2+\alpha_3}$ , where $\alpha_1$ , $\alpha_2$ , $\alpha_3$ , and $\alpha_4$ are positive constants weighting $TP$ , $TN$ , $FP$ , and $FN$ , respectively.

The paper provides a theorem stating that under certain conditions on the predictive space, the $\gamma$ -centroid estimator is equivalent to the estimator that maximizes the sum of marginalized probabilities $p_i$ that are greater than ${1}/(\gamma+1)$ in the prediction. $p_i=p(\theta_i=1|D)$ represents the marginalized probability of the $i$ -th dimension of the predictive space being 1, given the data $D$ .

For cases where $\gamma \in [0,1]$ , the paper presents a corollary showing that the $\gamma$ -centroid estimator contains its consensus estimator, which is an estimator that independently predicts each dimension based on whether its marginalized probability is greater than ${1}/(\gamma+1)$ .

The paper also discusses how to efficiently compute the $\gamma$ -centroid estimator for specific problems such as pairwise sequence alignment and RNA secondary structure prediction, using dynamic programming algorithms.

In addition to standard estimation problems where the probability distribution is defined on the predictive space, the paper introduces a new category of estimation problems where the probability distribution is defined on a parameter space that differs from the predictive space. Two types of estimators for such problems are discussed: estimators for representative prediction and estimators based on marginalized distributions.

For representative prediction, the paper introduces the concept of a homogeneous generalized gain function, which integrates the gain for each parameter. It shows that a representative prediction problem with any homogeneous generalized gain function can be solved similarly to a standard estimation problem with an averaged probability distribution.

For estimation problems involving marginalized distributions, the paper discusses the challenge of calculating the marginalized distribution in actual problems and introduces the concept of an approximated $\gamma$ -type estimator to reduce computational costs. This estimator uses a factorized probability distribution and a $\gamma$ -type pointwise gain function to reduce inconsistencies caused by the factorization.

The paper concludes by discussing the properties of the $\gamma$ -centroid estimator, including its relationship to existing estimators and its ability to balance sensitivity and positive predictive value. It also addresses the computational efficiency of the $\gamma$ -centroid estimator and its applicability to various problems in bioinformatics.

Key definitions from the paper:

Bayesian ML estimator: $\hat y^{(ML)}= \argmax_{y\in Y} p(y|D)$
Gain function: $G:Y \times Y \to \mathbb{R}^+$
MEG estimator: $\hat y^{(MEG)} = \argmax_{y \in Y} \int G(\theta, y) p(\theta|D)d\theta$
Pointwise gain function: $G(\theta, y)=\sum_{i=1}^n F_i(\theta, y_i)$
Consensus estimator: $\hat y^{(c)}_i = \argmax_{y_i \in \{0,1\}} E_{\theta|D}\left[F_i(\theta,y_i)\right]$
$\gamma$ -centroid estimator: MEG estimator with $F_i(\theta,y_i)=I(\theta_i=0)I(y_i=0)+\gamma I(\theta_i=1)I(y_i=1)$