Bayesian Information Criterion
- Bayesian Information Criterion is a model selection method that penalizes model complexity by incorporating the number of free parameters and sample size.
- It uses Laplace’s approximation to estimate the log-marginal likelihood, favoring simpler models when fits are comparable.
- Extensions of BIC adapt to latent variables, singular models, and complex data structures, improving consistency across diverse applications.
The Bayesian Information Criterion (BIC) is a model selection criterion derived as an asymptotic approximation to the log-marginal likelihood or Bayesian model evidence. BIC penalizes model complexity using a term based on the number of free parameters and the sample size, balancing fit against parsimony. BIC and its various extensions, including those for singular models, latent variable models, high-dimensional regimes, and correlated or incomplete data, are widely used across disciplines—ranging from time series to clustering, network analysis, and PDE discovery.
1. Derivation and Classical Formulation
BIC was introduced by Schwarz as an asymptotic expansion for the log-marginal likelihood under regularity and large-sample assumptions. For i.i.d. data and a parametric model (with %%%%2%%%% free parameters), BIC is derived from Laplace’s method applied to the integral
yielding, for large ,
where is the MLE. This motivates the criterion
An equivalent form in many applications is
where is the maximized likelihood (Geiger et al., 2013, Danek et al., 2020, Thanasutives et al., 23 Apr 2024). The penalty term dominates as increases, controlling overfitting by penalizing model complexity.
2. Extensions to Hidden Variables and Nonregular Structures
Standard BIC applies to regular models where parameter mappings are identifiable and the Fisher information is nonsingular. In models with hidden variables (e.g., Bayesian networks with latent nodes), the effective model dimension is not generally equal to the raw parameter count. Instead, the penalty should use the rank of the Jacobian of the mapping from model parameters to observables : with
(Geiger et al., 2013). For instance, in naive Bayes with a hidden root and binary observables, the effective dimension is $1 + 2n$, not the full parameter count.
In singular models (e.g., mixture models, factor analysis with redundant factors, reduced-rank regression), the true marginal likelihood has the asymptotic form
where is the real log canonical threshold (RLCT) and is its multiplicity; for regular models, and (Drton et al., 2013, Watanabe, 2012). The “singular BIC” or sBIC replaces in the penalty with , correcting for the local geometry of the model near singularities.
3. Adjustments for Incomplete, Correlated, or High-dimensional Data
BIC assumes all parameters are informed by i.i.d. draws. In incomplete datasets, as in factor analysis with missing-at-random entries, only samples may inform each variable . The hierarchical BIC (HBIC) adjusts the penalty,
where and are the parameters and observed count for (Zhao et al., 2022). HBIC is derived as a large-sample approximation to the variational Bayesian lower bound and asymptotically reduces to BIC as missingness vanishes.
For clustered or longitudinal data in linear mixed-effects models—where observations are correlated and i.i.d. is violated—the effective sample size replaces in the penalty: with , where is the data correlation matrix (Shen et al., 2021). This adjustment aligns the penalty with the Fisher information content under dependence.
In high-dimensional regression, a mixture-prior BIC fuses the traditional BIC and AIC according to sample-size and covariate dimension: with a mixing weight adapted to (Kono et al., 2022).
4. Singular Models and Widely Applicable Formulations
In models for which the Fisher information is singular or the mapping from parameters to distributions is non-injective, standard BIC under- or over-penalizes. Watanabe’s Widely Applicable BIC (WBIC) approximates the marginal likelihood by evaluating the expected log-likelihood under a “tempered” posterior at inverse temperature : Under mild regularity, , matching the two leading terms of the singular model expansion (Watanabe, 2012, Friel et al., 2015). The learning coefficient replaces the fixed , accounting for model nonregularity. WBIC is consistent under both regular and algebraic singular model classes.
For even greater bias correction in WBIC, an explicit adjustment subtracts a singular fluctuation term based on the variance of the log-likelihood under the tempered posterior (Imai, 2019).
5. Specialized Forms and Related Criteria
Applications to model selection in autoregressive models, clustering, and community detection in networks have motivated further extensions:
- Autoregression: The “bridge criterion” interpolates between BIC (consistent when the true order is finite) and AIC (prediction-efficient when the process order is infinite or unknown):
(Ding et al., 2015). The bridge criterion is adaptive, retaining the consistency of BIC and efficiency of AIC depending on the true model's complexity.
- Clustering: For partitioning normally distributed data, the closed-form “exact BIC” adjusts the penalty to over cluster sizes , in contrast to the global used in standard BIC, thus avoiding over- or under-penalization when clusters are small or unbalanced (Webster, 2020).
- Network Models: In stochastic block models (SBMs), the corrected BIC (CBIC) adds a further penalty, reflecting the combinatorial complexity of community assignments and maintaining consistency in detecting the number of communities or clusters, particularly in sparse or heterogeneous networks (Hu et al., 2016).
- Uncertainty-Penalized BIC: In data-driven PDE discovery, the uncertainty-penalized information criterion (UBIC) modifies BIC to
where quantifies estimator uncertainty, and can be interpreted as BIC applied to an overparameterized design (Thanasutives et al., 23 Apr 2024).
6. Assumptions, Consistency, and Practical Recommendations
BIC derivations rely on asymptotic () and regularity assumptions:
- i.i.d. sampling or a relevant adaptation thereof (e.g., effective sample size)
- Likelihood smoothness and unique (or ridge) MLE
- Prior positive and smooth near the MLE
- For extended/latent/hidden-variable models: analytic/polynomial parameter mappings and identifiability of parameter-to-observable transformations
Consistency of BIC and many extensions is established under fixed model classes and standard regularity. For singular models, consistency depends on the learning coefficient and the singular structure, but properly penalized criteria (WBIC, sBIC) remain consistent under the relevant algebraic-geometric assumptions.
When data are incomplete or correlated, variable-specific sample size or effective sample size corrections are essential to avoid over-penalizing or underestimating model complexity. In high-dimensional contexts, the choice of regularization prior or mixture weights tunes BIC-like criteria between BIC and AIC-like behavior, providing robust selection across regimes.
For model selection in latent variable networks and singular models, using the Jacobian rank or the real log canonical threshold (RLCT) ensures that only truly identifiable parameter combinations are penalized, correcting the biases that arise from naive parameter counting.
7. Comparative Performance and Application Scenarios
Empirical results across synthetic and real datasets demonstrate:
- BIC and its variants are strongly conservative and prevent overfitting in low dimensional, well-specified settings.
- AIC and related “light penalty” measures favor predictive efficiency in nonparametric and misspecified regimes but tend to overfit finite models.
- Novel adaptive criteria and principled penalty corrections (HBIC, UBIC, bridge criterion, MPIC, sBIC) allow models to adapt their complexity penalties to data structure, missingness, dependence, or high-dimensionality.
- In clustering and community detection, exact or corrected BIC expressions prevent the pathologies of standard BIC in small cluster or large community-number regimes.
In summary, BIC and its extensive extensions provide a rigorous asymptotic foundation for model selection in a broad array of parametric, semiparametric, latent variable, singular, high-dimensional, and non-independently sampled data contexts. Careful adaptation of the penalty term to effective dimension, sample information, and model singularity structure is critical for robust and consistent model selection.
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free