Papers
Topics
Authors
Recent
2000 character limit reached

Sparse Bayesian Partially Identified Model

Updated 17 December 2025
  • Sparse Bayesian Partially Identified Model is a statistical framework for differential analysis of high-dimensional sequence count data that explicitly models uncertainty in the absolute abundance scale and sparsity.
  • It extends the Scale-Reliant Inference framework by incorporating sparse effects and using empirical mode estimation to address nonidentifiability in compositional genomics data.
  • Empirical evaluations demonstrate that Sparse SSRV achieves robust FDR and TPR performance, outperforming or matching established methods in both simulated and real-world genomic studies.

A Sparse Bayesian Partially Identified Model (Sparse SSRV) is a statistical framework introduced for differential analysis of high-dimensional sequence count data, principally in genomics and metagenomics. The model addresses compositionality—a fundamental challenge in sequence count analysis—by embracing the nonidentifiability of absolute abundance scale and explicitly modeling uncertainty in both scaling and sparsity assumptions. The method extends the Scale-Reliant Inference (SRI) framework to incorporate sparse effect structures, providing theoretically consistent inference under scale uncertainty without imposing biologically implausible normalization constraints (Gu et al., 12 Dec 2025).

1. Statistical Formulation and Identifiability

Consider YdnY_{dn} as the observed read count for taxon dd (d=1,,D)(d=1,\ldots,D) and sample nn (n=1,,N)(n=1,\ldots,N). The model posits latent absolute abundances WdnW_{dn}, factorized as

Wdn=WdnWn,W_{dn} = W^{\parallel}_{dn} \cdot W^{\perp}_n,

where WnW^{\parallel}_{\cdot n} is a D-simplex vector—proportions of taxa per sample—and Wn>0W^{\perp}_n > 0 captures the total absolute load (scale) for sample nn. The likelihood is specified as

YnWnMultinomial(λn,Wn)Y_{\cdot n} | W^{\parallel}_{\cdot n} \sim \text{Multinomial}(\lambda_n, W^{\parallel}_{\cdot n})

with prior WnDirichlet(α1D)W^{\parallel}_{\cdot n} \sim \text{Dirichlet}(\alpha \, 1_D). Alternatively, Poisson-distributed counts may be used, preserving the rank-1 structure.

The target estimand for a binary condition xn{0,1}x_n \in \{0,1\} is the log-fold change for each taxon,

θd=En:xn=1[logWdn]En:xn=0[logWdn].\theta_d = \mathbb{E}_{n:x_n=1}[ \log W_{dn} ] - \mathbb{E}_{n:x_n=0}[ \log W_{dn} ].

Substituting the factorization, the parameter of interest admits a rank-1 decomposition,

θ=θ+θ1D,\theta = \theta^{\parallel} + \theta^{\perp} \cdot 1_D,

where

θd=E[logWdx=1]E[logWdx=0],θ=E[logWx=1]E[logWx=0].\theta^{\parallel}_d = \mathbb{E}[ \log W^{\parallel}_{d\cdot} | x=1 ] - \mathbb{E}[ \log W^{\parallel}_{d\cdot} | x=0 ], \quad \theta^{\perp} = \mathbb{E}[ \log W^{\perp} | x=1 ] - \mathbb{E}[ \log W^{\perp} | x=0 ].

Sequence data do not inform WnW^{\perp}_n; hence, θ\theta^{\perp} is nonidentifiable without prior information, and only the region θ+c1D\theta^{\parallel} + c 1_D for cRc \in \mathbb{R} is identified.

2. Sparse Modeling and Prior Specification

Sparsity in this context refers to the expectation that only a minority of taxa display log-fold changes between conditions. The model assigns i.i.d. draws to θd\theta^{\parallel}_d from a continuous density gg^{\parallel}, uniquely moded at zero. Sparsity is operationalized by imposing

θ=mode(g),\theta^{\perp} = -\text{mode}(g^{\parallel}),

centering the bulk of compositional effects near zero post-shifting. A prior is placed on θ\theta^{\perp} conditional on θ\theta^{\parallel} using a consistent estimator for the mode—typically Parzen's kernel estimator,

ψ(θ)=argmaxt[tl,tu]pD(t;θ),pD(t)=1Dd=1DKh(tθd).\psi(\theta^{\parallel}) = \underset{t \in [t_l, t_u]}{\arg\max} \, p_D(t; \theta^{\parallel}), \quad p_D(t) = \frac{1}{D} \sum_{d=1}^D K_h(t-\theta^{\parallel}_d).

The scale prior is p(θθ)p(\theta^{\perp} | \theta^{\parallel}) centered at ψ(θ)-\psi(\theta^{\parallel}), with variance chosen via Laplace approximation or bootstrap to reflect uncertainty in the sparseness assumption.

3. Inference via Scale-Reliant Inference (SRI)

Posterior estimation proceeds through the SRI framework:

  • Marginals factorize as p(θ,θY)=p(θY)p(θθ)p(\theta^{\parallel}, \theta^{\perp} | Y) = p(\theta^{\parallel} | Y) \, p(\theta^{\perp} | \theta^{\parallel}), with conditional independence θYθ\theta^{\perp} \perp Y | \theta^{\parallel}.
  • Draw posterior samples Wn(s)Dirichlet(Yn+α1D)W^{\parallel(s)}_{\cdot n} \sim \text{Dirichlet}(Y_{\cdot n} + \alpha 1_D).
  • For each sample, compute θ(s)\theta^{\parallel(s)} by averaging logWd(s)\log W^{\parallel(s)}_{d\cdot} over xn=1x_n = 1 and xn=0x_n = 0.
  • Estimate mode shift s(s)=ψ(θ(s))s^{(s)} = \psi(\theta^{\parallel(s)}).
  • Sample θ(s)N(s(s),τ2)\theta^{\perp(s)} \sim N(-s^{(s)}, \tau^2), where τ2\tau^2 is estimated from the local curvature of the Parzen mode or via bootstrap.
  • Combine to yield θ(s)=θ(s)+θ(s)1D\theta^{(s)} = \theta^{\parallel(s)} + \theta^{\perp(s)} 1_D, generating posterior draws for inference.

4. Theoretical Properties

Under multinomial–Dirichlet regularity, with gg^{\parallel} uniformly continuous and uniquely moded, and standard assumptions on kernel bandwidth h(D)0h(D) \to 0, Dh(D)2D h(D)^2 \to \infty, and sequencing depth λ\lambda \to \infty:

  • The estimator WdnYnpWdn,W^{\parallel}_{dn} | Y_{\cdot n} \to_p W^{\parallel,*}_{dn}.
  • The estimator θdYpθd,\theta^{\parallel}_d | Y \to_p \theta^{\parallel,*}_d.
  • The Parzen mode estimator ψ(θ)\psi(\theta^{\parallel}) is consistent for mode(g)\text{mode}(g^{\parallel}).
  • The Bayes estimator θˉ=E[θY]\bar \theta = \mathbb{E}[\theta | Y] satisfies consistency θˉpθ\bar \theta \to_p \theta^* as λ\lambda \to \infty, DD \to \infty.
  • Posterior consistency holds: if Var(θθ)0\text{Var}(\theta^{\perp} | \theta^{\parallel}) \to 0, then θdYpθd\theta_d | Y \to_p \theta^*_d for each taxon dd.

5. Empirical Performance and Comparative Evaluation

The Sparse SSRV estimator was evaluated using extensive simulations (SparseDOSSA2) varying the number of taxa DD, sample size NN, sequencing depth λ\lambda, and proportion of true associations, and compared against ALDEx2 (CLR and default SSRV), DESeq2, edgeR, limma-voom, baySeq, MiMIX, ANCOM, ANCOM-BC/BC2, CKF, and LinDA. The metrics included false discovery rate (FDR), true positive rate (TPR), and F0.5F_{0.5}.

Scenario Sparse SSRV FDR TPR Competitor Notes
All (simulations) <<0.05 D,N\uparrow D,N Outperformed/matched best methods
Large DD <<0.05 -- Only LinDA faster

On real datasets, six case studies with known or gold standard truths demonstrated:

  • In sparse/low-variance settings, near-perfect FDR and highest TPR.
  • In sparse/high-variance settings, zero false positives and conservative positive detection.
  • In dense settings, moderate FDR with some power loss for increased robustness.

6. Context and Implications

The Sparse Bayesian Partially Identified Model, or Sparse SSRV, advances differential analysis for sequence count data by incorporating uncertainty in both scale and sparsity. Unlike normalization-based methods that implicitly impose strong, often biologically implausible assumptions, and traditional sparse estimators that neglect uncertainty in sparsity, Sparse SSRV propagates all sources of uncertainty throughout the inference procedure. Type I error is controlled under violations of scale invariance, and theoretical consistency is achieved under minimal and biologically plausible assumptions (Gu et al., 12 Dec 2025).

A plausible implication is that this framework is extendable to other compositional and partially identified problems facing similar nonidentifiability and sparsity challenges. The integration of prior structure using empirical mode estimation and the explicit modeling of posterior uncertainty in hyperparameters provides a template for robust inference in high-dimensional, compositional genomics contexts.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Sparse Bayesian Partially Identified Model (PIM).