Bayesian Information Criterion

Updated 10 November 2025

Bayesian Information Criterion is a model selection method that penalizes model complexity by incorporating the number of free parameters and sample size.
It uses Laplace’s approximation to estimate the log-marginal likelihood, favoring simpler models when fits are comparable.
Extensions of BIC adapt to latent variables, singular models, and complex data structures, improving consistency across diverse applications.

The Bayesian Information Criterion (BIC) is a model selection criterion derived as an asymptotic approximation to the log-marginal likelihood or Bayesian model evidence. BIC penalizes model complexity using a term based on the number of free parameters and the sample size, balancing fit against parsimony. BIC and its various extensions, including those for singular models, latent variable models, high-dimensional regimes, and correlated or incomplete data, are widely used across disciplines—ranging from time series to clustering, network analysis, and PDE discovery.

1. Derivation and Classical Formulation

BIC was introduced by Schwarz as an asymptotic expansion for the log-marginal likelihood under regularity and large-sample assumptions. For i.i.d. data $D = \{x_1, ..., x_N\}$ and a parametric model $p(x|\theta)$ (with %%%%2%%%% free parameters), BIC is derived from Laplace’s method applied to the integral

$p(D|M) = \int p(D|\theta, M) p(\theta|M) d\theta,$

yielding, for large $N$ ,

$\log p(D|M) \approx \log p(D|\hat\theta, M) - \frac{d}{2} \log N + \mathcal{O}(1),$

where $\hat\theta$ is the MLE. This motivates the criterion

$\mathrm{BIC}(M) = \log p(D|\hat\theta, M) - \frac{d}{2} \log N.$

An equivalent form in many applications is

$\mathrm{BIC} = -2\ \log L(\hat\theta) + d \log N,$

where $L(\hat\theta)$ is the maximized likelihood (Geiger et al., 2013, Danek et al., 2020, Thanasutives et al., 23 Apr 2024). The penalty term $(d/2)\log N$ dominates as $N$ increases, controlling overfitting by penalizing model complexity.

2. Extensions to Hidden Variables and Nonregular Structures

Standard BIC applies to regular models where parameter mappings are identifiable and the Fisher information is nonsingular. In models with hidden variables (e.g., Bayesian networks with latent nodes), the effective model dimension is not generally equal to the raw parameter count. Instead, the penalty should use the rank of the Jacobian $J(\theta) = \partial W / \partial \theta$ of the mapping from model parameters $\theta$ to observables $W$ : $\dim(S, \hat\theta) = \mathrm{rank}\ J(\hat\theta),$ with

$\mathrm{BIC}(S) = \log p(D \mid S, \hat\theta_S) - \frac{1}{2} \dim(S, \hat\theta_S) \log N + \mathcal{O}(1)$

(Geiger et al., 2013). For instance, in naive Bayes with a hidden root and $n$ binary observables, the effective dimension is $1 + 2n$, not the full parameter count.

In singular models (e.g., mixture models, factor analysis with redundant factors, reduced-rank regression), the true marginal likelihood has the asymptotic form

$\log m(X|M) = \ell_N(\hat\theta) - \lambda \log N + (m-1)\log \log N + \mathcal{O}_p(1),$

where $\lambda$ is the real log canonical threshold (RLCT) and $m$ is its multiplicity; for regular models, $\lambda=d/2$ and $m=1$ (Drton et al., 2013, Watanabe, 2012). The “singular BIC” or sBIC replaces $d/2$ in the penalty with $\lambda$ , correcting for the local geometry of the model near singularities.

3. Adjustments for Incomplete, Correlated, or High-dimensional Data

BIC assumes all parameters are informed by $N$ i.i.d. draws. In incomplete datasets, as in factor analysis with missing-at-random entries, only $N_i\leq N$ samples may inform each variable $x_i$ . The hierarchical BIC (HBIC) adjusts the penalty,

$\mathrm{HBIC} = \sum_{n=1}^N \log p(x_n^{\mathrm{obs}} \mid \hat\theta) - \sum_{i=1}^d \frac{d_i}{2} \log N_i,$

where $d_i$ and $N_i$ are the parameters and observed count for $x_i$ (Zhao et al., 2022). HBIC is derived as a large-sample approximation to the variational Bayesian lower bound and asymptotically reduces to BIC as missingness vanishes.

For clustered or longitudinal data in linear mixed-effects models—where observations are correlated and i.i.d. is violated—the effective sample size $n_e$ replaces $n$ in the penalty: $\mathrm{BIC}_{n_e} = -2\log L(\hat\theta) + p \log n_e,$ with $n_e = \sum_{i,j=1}^{n} (R^{-1})_{ij}$ , where $R$ is the data correlation matrix (Shen et al., 2021). This adjustment aligns the penalty with the Fisher information content under dependence.

In high-dimensional regression, a mixture-prior BIC fuses the traditional BIC and AIC according to sample-size and covariate dimension: $\mathrm{MPIC}_{\mathrm{Approx}}(j) = -2 \log f(Y \mid X_j \widehat\beta_j, \widehat\Sigma_j) + \frac{np(2k_j+p+1)}{n - p - k_j - 1} - 2 \log w_j,$ with $w_j$ a mixing weight adapted to $(n, p)$ (Kono et al., 2022).

4. Singular Models and Widely Applicable Formulations

In models for which the Fisher information is singular or the mapping from parameters to distributions is non-injective, standard BIC under- or over-penalizes. Watanabe’s Widely Applicable BIC (WBIC) approximates the marginal likelihood by evaluating the expected log-likelihood under a “tempered” posterior at inverse temperature $1/\log n$ : $\mathrm{WBIC}_n = -E_{w \sim p_{1/\log n}(\cdot \mid D_n)} \left[ \log \prod_{i=1}^n p(X_i | w) \right].$ Under mild regularity, $\mathrm{WBIC} = n L(w_0) + \lambda \log n + O_p(\log \log n)$ , matching the two leading terms of the singular model expansion (Watanabe, 2012, Friel et al., 2015). The learning coefficient $\lambda$ replaces the fixed $d/2$ , accounting for model nonregularity. WBIC is consistent under both regular and algebraic singular model classes.

For even greater bias correction in WBIC, an explicit adjustment subtracts a singular fluctuation term $\nu(1/\log n)$ based on the variance of the log-likelihood under the tempered posterior (Imai, 2019).

Applications to model selection in autoregressive models, clustering, and community detection in networks have motivated further extensions:

Autoregression: The “bridge criterion” interpolates between BIC (consistent when the true order is finite) and AIC (prediction-efficient when the process order is infinite or unknown):

$\mathrm{bic}(N,L) = \log \hat{e}_L + \frac{L\log n}{N}$

(Ding et al., 2015). The bridge criterion is adaptive, retaining the consistency of BIC and efficiency of AIC depending on the true model's complexity.

Clustering: For partitioning normally distributed data, the closed-form “exact BIC” adjusts the penalty to $\sum_g p \log n_g$ over cluster sizes $n_g$ , in contrast to the global $Kp\log n$ used in standard BIC, thus avoiding over- or under-penalization when clusters are small or unbalanced (Webster, 2020).
Network Models: In stochastic block models (SBMs), the corrected BIC (CBIC) adds a further $\lambda n\log k$ penalty, reflecting the combinatorial complexity of community assignments and maintaining consistency in detecting the number of communities or clusters, particularly in sparse or heterogeneous networks (Hu et al., 2016).
Uncertainty-Penalized BIC: In data-driven PDE discovery, the uncertainty-penalized information criterion (UBIC) modifies BIC to

$\mathrm{UBIC}_\Phi(\xi,U_\xi) = -2\log \mathcal{L}(\Phi,\xi) + \log(N) (\|\xi\|_0 + U_\xi),$

where $U_\xi$ quantifies estimator uncertainty, and can be interpreted as BIC applied to an overparameterized design (Thanasutives et al., 23 Apr 2024).

6. Assumptions, Consistency, and Practical Recommendations

BIC derivations rely on asymptotic ( $N \to \infty$ ) and regularity assumptions:

i.i.d. sampling or a relevant adaptation thereof (e.g., effective sample size)
Likelihood smoothness and unique (or ridge) MLE
Prior positive and smooth near the MLE
For extended/latent/hidden-variable models: analytic/polynomial parameter mappings and identifiability of parameter-to-observable transformations

Consistency of BIC and many extensions is established under fixed model classes and standard regularity. For singular models, consistency depends on the learning coefficient and the singular structure, but properly penalized criteria (WBIC, sBIC) remain consistent under the relevant algebraic-geometric assumptions.

When data are incomplete or correlated, variable-specific sample size or effective sample size corrections are essential to avoid over-penalizing or underestimating model complexity. In high-dimensional contexts, the choice of regularization prior or mixture weights tunes BIC-like criteria between BIC and AIC-like behavior, providing robust selection across regimes.

For model selection in latent variable networks and singular models, using the Jacobian rank or the real log canonical threshold (RLCT) ensures that only truly identifiable parameter combinations are penalized, correcting the biases that arise from naive parameter counting.

7. Comparative Performance and Application Scenarios

Empirical results across synthetic and real datasets demonstrate:

BIC and its variants are strongly conservative and prevent overfitting in low dimensional, well-specified settings.
AIC and related “light penalty” measures favor predictive efficiency in nonparametric and misspecified regimes but tend to overfit finite models.
Novel adaptive criteria and principled penalty corrections (HBIC, UBIC, bridge criterion, MPIC, sBIC) allow models to adapt their complexity penalties to data structure, missingness, dependence, or high-dimensionality.
In clustering and community detection, exact or corrected BIC expressions prevent the pathologies of standard BIC in small cluster or large community-number regimes.

In summary, BIC and its extensive extensions provide a rigorous asymptotic foundation for model selection in a broad array of parametric, semiparametric, latent variable, singular, high-dimensional, and non-independently sampled data contexts. Careful adaptation of the penalty term to effective dimension, sample information, and model singularity structure is critical for robust and consistent model selection.