Sparse Bayesian Partially Identified Model

Updated 17 December 2025

Sparse Bayesian Partially Identified Model is a statistical framework for differential analysis of high-dimensional sequence count data that explicitly models uncertainty in the absolute abundance scale and sparsity.
It extends the Scale-Reliant Inference framework by incorporating sparse effects and using empirical mode estimation to address nonidentifiability in compositional genomics data.
Empirical evaluations demonstrate that Sparse SSRV achieves robust FDR and TPR performance, outperforming or matching established methods in both simulated and real-world genomic studies.

A Sparse Bayesian Partially Identified Model (Sparse SSRV) is a statistical framework introduced for differential analysis of high-dimensional sequence count data, principally in genomics and metagenomics. The model addresses compositionality—a fundamental challenge in sequence count analysis—by embracing the nonidentifiability of absolute abundance scale and explicitly modeling uncertainty in both scaling and sparsity assumptions. The method extends the Scale-Reliant Inference (SRI) framework to incorporate sparse effect structures, providing theoretically consistent inference under scale uncertainty without imposing biologically implausible normalization constraints (Gu et al., 12 Dec 2025).

1. Statistical Formulation and Identifiability

Consider $Y_{dn}$ as the observed read count for taxon $d$ $(d=1,\ldots,D)$ and sample $n$ $(n=1,\ldots,N)$ . The model posits latent absolute abundances $W_{dn}$ , factorized as

$W_{dn} = W^{\parallel}_{dn} \cdot W^{\perp}_n,$

where $W^{\parallel}_{\cdot n}$ is a D-simplex vector—proportions of taxa per sample—and $W^{\perp}_n > 0$ captures the total absolute load (scale) for sample $n$ . The likelihood is specified as

$Y_{\cdot n} | W^{\parallel}_{\cdot n} \sim \text{Multinomial}(\lambda_n, W^{\parallel}_{\cdot n})$

with prior $W^{\parallel}_{\cdot n} \sim \text{Dirichlet}(\alpha \, 1_D)$ . Alternatively, Poisson-distributed counts may be used, preserving the rank-1 structure.

The target estimand for a binary condition $x_n \in \{0,1\}$ is the log-fold change for each taxon,

$\theta_d = \mathbb{E}_{n:x_n=1}[ \log W_{dn} ] - \mathbb{E}_{n:x_n=0}[ \log W_{dn} ].$

Substituting the factorization, the parameter of interest admits a rank-1 decomposition,

$\theta = \theta^{\parallel} + \theta^{\perp} \cdot 1_D,$

where

$\theta^{\parallel}_d = \mathbb{E}[ \log W^{\parallel}_{d\cdot} | x=1 ] - \mathbb{E}[ \log W^{\parallel}_{d\cdot} | x=0 ], \quad \theta^{\perp} = \mathbb{E}[ \log W^{\perp} | x=1 ] - \mathbb{E}[ \log W^{\perp} | x=0 ].$

Sequence data do not inform $W^{\perp}_n$ ; hence, $\theta^{\perp}$ is nonidentifiable without prior information, and only the region $\theta^{\parallel} + c 1_D$ for $c \in \mathbb{R}$ is identified.

2. Sparse Modeling and Prior Specification

Sparsity in this context refers to the expectation that only a minority of taxa display log-fold changes between conditions. The model assigns i.i.d. draws to $\theta^{\parallel}_d$ from a continuous density $g^{\parallel}$ , uniquely moded at zero. Sparsity is operationalized by imposing

$\theta^{\perp} = -\text{mode}(g^{\parallel}),$

centering the bulk of compositional effects near zero post-shifting. A prior is placed on $\theta^{\perp}$ conditional on $\theta^{\parallel}$ using a consistent estimator for the mode—typically Parzen's kernel estimator,

$\psi(\theta^{\parallel}) = \underset{t \in [t_l, t_u]}{\arg\max} \, p_D(t; \theta^{\parallel}), \quad p_D(t) = \frac{1}{D} \sum_{d=1}^D K_h(t-\theta^{\parallel}_d).$

The scale prior is $p(\theta^{\perp} | \theta^{\parallel})$ centered at $-\psi(\theta^{\parallel})$ , with variance chosen via Laplace approximation or bootstrap to reflect uncertainty in the sparseness assumption.

3. Inference via Scale-Reliant Inference (SRI)

Posterior estimation proceeds through the SRI framework:

Marginals factorize as $p(\theta^{\parallel}, \theta^{\perp} | Y) = p(\theta^{\parallel} | Y) \, p(\theta^{\perp} | \theta^{\parallel})$ , with conditional independence $\theta^{\perp} \perp Y | \theta^{\parallel}$ .
Draw posterior samples $W^{\parallel(s)}_{\cdot n} \sim \text{Dirichlet}(Y_{\cdot n} + \alpha 1_D)$ .
For each sample, compute $\theta^{\parallel(s)}$ by averaging $\log W^{\parallel(s)}_{d\cdot}$ over $x_n = 1$ and $x_n = 0$ .
Estimate mode shift $s^{(s)} = \psi(\theta^{\parallel(s)})$ .
Sample $\theta^{\perp(s)} \sim N(-s^{(s)}, \tau^2)$ , where $\tau^2$ is estimated from the local curvature of the Parzen mode or via bootstrap.
Combine to yield $\theta^{(s)} = \theta^{\parallel(s)} + \theta^{\perp(s)} 1_D$ , generating posterior draws for inference.

4. Theoretical Properties

Under multinomial–Dirichlet regularity, with $g^{\parallel}$ uniformly continuous and uniquely moded, and standard assumptions on kernel bandwidth $h(D) \to 0$ , $D h(D)^2 \to \infty$ , and sequencing depth $\lambda \to \infty$ :

The estimator $W^{\parallel}_{dn} | Y_{\cdot n} \to_p W^{\parallel,*}_{dn}$ .
The estimator $\theta^{\parallel}_d | Y \to_p \theta^{\parallel,*}_d$ .
The Parzen mode estimator $\psi(\theta^{\parallel})$ is consistent for $\text{mode}(g^{\parallel})$ .
The Bayes estimator $\bar \theta = \mathbb{E}[\theta | Y]$ satisfies consistency $\bar \theta \to_p \theta^*$ as $\lambda \to \infty$ , $D \to \infty$ .
Posterior consistency holds: if $\text{Var}(\theta^{\perp} | \theta^{\parallel}) \to 0$ , then $\theta_d | Y \to_p \theta^*_d$ for each taxon $d$ .

5. Empirical Performance and Comparative Evaluation

The Sparse SSRV estimator was evaluated using extensive simulations (SparseDOSSA2) varying the number of taxa $D$ , sample size $N$ , sequencing depth $\lambda$ , and proportion of true associations, and compared against ALDEx2 (CLR and default SSRV), DESeq2, edgeR, limma-voom, baySeq, MiMIX, ANCOM, ANCOM-BC/BC2, CKF, and LinDA. The metrics included false discovery rate (FDR), true positive rate (TPR), and $F_{0.5}$ .

Scenario	Sparse SSRV FDR	TPR	Competitor Notes
All (simulations)	$<$ 0.05	$\uparrow D,N$	Outperformed/matched best methods
Large $D$	$<$ 0.05	--	Only LinDA faster

On real datasets, six case studies with known or gold standard truths demonstrated:

In sparse/low-variance settings, near-perfect FDR and highest TPR.
In sparse/high-variance settings, zero false positives and conservative positive detection.
In dense settings, moderate FDR with some power loss for increased robustness.

6. Context and Implications

The Sparse Bayesian Partially Identified Model, or Sparse SSRV, advances differential analysis for sequence count data by incorporating uncertainty in both scale and sparsity. Unlike normalization-based methods that implicitly impose strong, often biologically implausible assumptions, and traditional sparse estimators that neglect uncertainty in sparsity, Sparse SSRV propagates all sources of uncertainty throughout the inference procedure. Type I error is controlled under violations of scale invariance, and theoretical consistency is achieved under minimal and biologically plausible assumptions (Gu et al., 12 Dec 2025).

A plausible implication is that this framework is extendable to other compositional and partially identified problems facing similar nonidentifiability and sparsity challenges. The integration of prior structure using empirical mode estimation and the explicit modeling of posterior uncertainty in hyperparameters provides a template for robust inference in high-dimensional, compositional genomics contexts.

PDF Markdown Chat (Pro)

References (1)

Sparse Bayesian Partially Identified Models for Sequence Count Data (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Sparse Bayesian Partially Identified Model (PIM).