Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
GPT-5.1
GPT-5.1 96 tok/s
Gemini 3.0 Pro 48 tok/s Pro
Gemini 2.5 Flash 155 tok/s Pro
Kimi K2 197 tok/s Pro
Claude Sonnet 4.5 36 tok/s Pro
2000 character limit reached

Bayesian Information Criterion

Updated 10 November 2025
  • Bayesian Information Criterion is a model selection method that penalizes model complexity by incorporating the number of free parameters and sample size.
  • It uses Laplace’s approximation to estimate the log-marginal likelihood, favoring simpler models when fits are comparable.
  • Extensions of BIC adapt to latent variables, singular models, and complex data structures, improving consistency across diverse applications.

The Bayesian Information Criterion (BIC) is a model selection criterion derived as an asymptotic approximation to the log-marginal likelihood or Bayesian model evidence. BIC penalizes model complexity using a term based on the number of free parameters and the sample size, balancing fit against parsimony. BIC and its various extensions, including those for singular models, latent variable models, high-dimensional regimes, and correlated or incomplete data, are widely used across disciplines—ranging from time series to clustering, network analysis, and PDE discovery.

1. Derivation and Classical Formulation

BIC was introduced by Schwarz as an asymptotic expansion for the log-marginal likelihood under regularity and large-sample assumptions. For i.i.d. data D={x1,...,xN}D = \{x_1, ..., x_N\} and a parametric model p(xθ)p(x|\theta) (with %%%%2%%%% free parameters), BIC is derived from Laplace’s method applied to the integral

p(DM)=p(Dθ,M)p(θM)dθ,p(D|M) = \int p(D|\theta, M) p(\theta|M) d\theta,

yielding, for large NN,

logp(DM)logp(Dθ^,M)d2logN+O(1),\log p(D|M) \approx \log p(D|\hat\theta, M) - \frac{d}{2} \log N + \mathcal{O}(1),

where θ^\hat\theta is the MLE. This motivates the criterion

BIC(M)=logp(Dθ^,M)d2logN.\mathrm{BIC}(M) = \log p(D|\hat\theta, M) - \frac{d}{2} \log N.

An equivalent form in many applications is

BIC=2 logL(θ^)+dlogN,\mathrm{BIC} = -2\ \log L(\hat\theta) + d \log N,

where L(θ^)L(\hat\theta) is the maximized likelihood (Geiger et al., 2013, Danek et al., 2020, Thanasutives et al., 23 Apr 2024). The penalty term (d/2)logN(d/2)\log N dominates as NN increases, controlling overfitting by penalizing model complexity.

2. Extensions to Hidden Variables and Nonregular Structures

Standard BIC applies to regular models where parameter mappings are identifiable and the Fisher information is nonsingular. In models with hidden variables (e.g., Bayesian networks with latent nodes), the effective model dimension is not generally equal to the raw parameter count. Instead, the penalty should use the rank of the Jacobian J(θ)=W/θJ(\theta) = \partial W / \partial \theta of the mapping from model parameters θ\theta to observables WW: dim(S,θ^)=rank J(θ^),\dim(S, \hat\theta) = \mathrm{rank}\ J(\hat\theta), with

BIC(S)=logp(DS,θ^S)12dim(S,θ^S)logN+O(1)\mathrm{BIC}(S) = \log p(D \mid S, \hat\theta_S) - \frac{1}{2} \dim(S, \hat\theta_S) \log N + \mathcal{O}(1)

(Geiger et al., 2013). For instance, in naive Bayes with a hidden root and nn binary observables, the effective dimension is $1 + 2n$, not the full parameter count.

In singular models (e.g., mixture models, factor analysis with redundant factors, reduced-rank regression), the true marginal likelihood has the asymptotic form

logm(XM)=N(θ^)λlogN+(m1)loglogN+Op(1),\log m(X|M) = \ell_N(\hat\theta) - \lambda \log N + (m-1)\log \log N + \mathcal{O}_p(1),

where λ\lambda is the real log canonical threshold (RLCT) and mm is its multiplicity; for regular models, λ=d/2\lambda=d/2 and m=1m=1 (Drton et al., 2013, Watanabe, 2012). The “singular BIC” or sBIC replaces d/2d/2 in the penalty with λ\lambda, correcting for the local geometry of the model near singularities.

3. Adjustments for Incomplete, Correlated, or High-dimensional Data

BIC assumes all parameters are informed by NN i.i.d. draws. In incomplete datasets, as in factor analysis with missing-at-random entries, only NiNN_i\leq N samples may inform each variable xix_i. The hierarchical BIC (HBIC) adjusts the penalty,

HBIC=n=1Nlogp(xnobsθ^)i=1ddi2logNi,\mathrm{HBIC} = \sum_{n=1}^N \log p(x_n^{\mathrm{obs}} \mid \hat\theta) - \sum_{i=1}^d \frac{d_i}{2} \log N_i,

where did_i and NiN_i are the parameters and observed count for xix_i (Zhao et al., 2022). HBIC is derived as a large-sample approximation to the variational Bayesian lower bound and asymptotically reduces to BIC as missingness vanishes.

For clustered or longitudinal data in linear mixed-effects models—where observations are correlated and i.i.d. is violated—the effective sample size nen_e replaces nn in the penalty: BICne=2logL(θ^)+plogne,\mathrm{BIC}_{n_e} = -2\log L(\hat\theta) + p \log n_e, with ne=i,j=1n(R1)ijn_e = \sum_{i,j=1}^{n} (R^{-1})_{ij}, where RR is the data correlation matrix (Shen et al., 2021). This adjustment aligns the penalty with the Fisher information content under dependence.

In high-dimensional regression, a mixture-prior BIC fuses the traditional BIC and AIC according to sample-size and covariate dimension: MPICApprox(j)=2logf(YXjβ^j,Σ^j)+np(2kj+p+1)npkj12logwj,\mathrm{MPIC}_{\mathrm{Approx}}(j) = -2 \log f(Y \mid X_j \widehat\beta_j, \widehat\Sigma_j) + \frac{np(2k_j+p+1)}{n - p - k_j - 1} - 2 \log w_j, with wjw_j a mixing weight adapted to (n,p)(n, p) (Kono et al., 2022).

4. Singular Models and Widely Applicable Formulations

In models for which the Fisher information is singular or the mapping from parameters to distributions is non-injective, standard BIC under- or over-penalizes. Watanabe’s Widely Applicable BIC (WBIC) approximates the marginal likelihood by evaluating the expected log-likelihood under a “tempered” posterior at inverse temperature 1/logn1/\log n: WBICn=Ewp1/logn(Dn)[logi=1np(Xiw)].\mathrm{WBIC}_n = -E_{w \sim p_{1/\log n}(\cdot \mid D_n)} \left[ \log \prod_{i=1}^n p(X_i | w) \right]. Under mild regularity, WBIC=nL(w0)+λlogn+Op(loglogn)\mathrm{WBIC} = n L(w_0) + \lambda \log n + O_p(\log \log n), matching the two leading terms of the singular model expansion (Watanabe, 2012, Friel et al., 2015). The learning coefficient λ\lambda replaces the fixed d/2d/2, accounting for model nonregularity. WBIC is consistent under both regular and algebraic singular model classes.

For even greater bias correction in WBIC, an explicit adjustment subtracts a singular fluctuation term ν(1/logn)\nu(1/\log n) based on the variance of the log-likelihood under the tempered posterior (Imai, 2019).

Applications to model selection in autoregressive models, clustering, and community detection in networks have motivated further extensions:

  • Autoregression: The “bridge criterion” interpolates between BIC (consistent when the true order is finite) and AIC (prediction-efficient when the process order is infinite or unknown):

bic(N,L)=loge^L+LlognN\mathrm{bic}(N,L) = \log \hat{e}_L + \frac{L\log n}{N}

(Ding et al., 2015). The bridge criterion is adaptive, retaining the consistency of BIC and efficiency of AIC depending on the true model's complexity.

  • Clustering: For partitioning normally distributed data, the closed-form “exact BIC” adjusts the penalty to gplogng\sum_g p \log n_g over cluster sizes ngn_g, in contrast to the global KplognKp\log n used in standard BIC, thus avoiding over- or under-penalization when clusters are small or unbalanced (Webster, 2020).
  • Network Models: In stochastic block models (SBMs), the corrected BIC (CBIC) adds a further λnlogk\lambda n\log k penalty, reflecting the combinatorial complexity of community assignments and maintaining consistency in detecting the number of communities or clusters, particularly in sparse or heterogeneous networks (Hu et al., 2016).
  • Uncertainty-Penalized BIC: In data-driven PDE discovery, the uncertainty-penalized information criterion (UBIC) modifies BIC to

UBICΦ(ξ,Uξ)=2logL(Φ,ξ)+log(N)(ξ0+Uξ),\mathrm{UBIC}_\Phi(\xi,U_\xi) = -2\log \mathcal{L}(\Phi,\xi) + \log(N) (\|\xi\|_0 + U_\xi),

where UξU_\xi quantifies estimator uncertainty, and can be interpreted as BIC applied to an overparameterized design (Thanasutives et al., 23 Apr 2024).

6. Assumptions, Consistency, and Practical Recommendations

BIC derivations rely on asymptotic (NN \to \infty) and regularity assumptions:

  • i.i.d. sampling or a relevant adaptation thereof (e.g., effective sample size)
  • Likelihood smoothness and unique (or ridge) MLE
  • Prior positive and smooth near the MLE
  • For extended/latent/hidden-variable models: analytic/polynomial parameter mappings and identifiability of parameter-to-observable transformations

Consistency of BIC and many extensions is established under fixed model classes and standard regularity. For singular models, consistency depends on the learning coefficient and the singular structure, but properly penalized criteria (WBIC, sBIC) remain consistent under the relevant algebraic-geometric assumptions.

When data are incomplete or correlated, variable-specific sample size or effective sample size corrections are essential to avoid over-penalizing or underestimating model complexity. In high-dimensional contexts, the choice of regularization prior or mixture weights tunes BIC-like criteria between BIC and AIC-like behavior, providing robust selection across regimes.

For model selection in latent variable networks and singular models, using the Jacobian rank or the real log canonical threshold (RLCT) ensures that only truly identifiable parameter combinations are penalized, correcting the biases that arise from naive parameter counting.

7. Comparative Performance and Application Scenarios

Empirical results across synthetic and real datasets demonstrate:

  • BIC and its variants are strongly conservative and prevent overfitting in low dimensional, well-specified settings.
  • AIC and related “light penalty” measures favor predictive efficiency in nonparametric and misspecified regimes but tend to overfit finite models.
  • Novel adaptive criteria and principled penalty corrections (HBIC, UBIC, bridge criterion, MPIC, sBIC) allow models to adapt their complexity penalties to data structure, missingness, dependence, or high-dimensionality.
  • In clustering and community detection, exact or corrected BIC expressions prevent the pathologies of standard BIC in small cluster or large community-number regimes.

In summary, BIC and its extensive extensions provide a rigorous asymptotic foundation for model selection in a broad array of parametric, semiparametric, latent variable, singular, high-dimensional, and non-independently sampled data contexts. Careful adaptation of the penalty term to effective dimension, sample information, and model singularity structure is critical for robust and consistent model selection.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Bayesian Information Criterion.