Automated Bayesian Model Discovery

Updated 9 May 2026

Automated Bayesian Model Discovery is a suite of frameworks that use Bayesian inference to automatically select or construct probabilistic models, adapting model structure to the data.
It integrates methodologies such as nonparametric Bayesian inference, active marginal likelihood estimation, and meta-learning for AutoML to efficiently explore vast model spaces.
These approaches enable adaptive, interpretable, and uncertainty-aware learning for applications including clustering, symbolic regression, causal discovery, and exploratory data analysis.

Automated Bayesian Model Discovery refers to a collection of principled computational frameworks that select or construct probabilistic models from data through explicit Bayesian reasoning, without manual intervention. Such frameworks employ generative modeling, marginal likelihood (model evidence) computation, and optimization or search over complex (often combinatorial or infinite) spaces of candidate models. Their scope includes mixture models with an unknown number of components, structure and parameter learning in graphical models, discovery of new classes or clusters, symbolic regression, model-based causal discovery, and dynamic, data-driven engineering of agent models. The central aim is to let the model space, evidence, and data jointly determine inductive complexity, structure, and parametric form, enabling adaptive, interpretable, and uncertainty-aware learning.

1. Bayesian Nonparametric and Mixture Models

Automated Bayesian model discovery emerged early in nonparametric Bayesian inference, where models employ priors over countably infinite sets of structures. The Dirichlet Process (DP) and its generalizations form the canonical prior for mixture models with unknown numbers of clusters or classes. In the framework proposed by Dündar et al., class–parameter atoms $\{\theta_k\}$ are drawn from a stick-breaking DP prior with concentration parameter $\alpha$ and base measure $H$ , leading to a discrete random measure

$G = \sum_{k=1}^\infty \pi_k \, \delta_{\theta_k}, \quad G \sim \mathrm{DP}(\alpha, H)$

Each observation is modeled as normal given its class parameter $(\mu_k, \Sigma_k)$ , with $H$ chosen as a Normal–Inverse–Wishart prior to ensure conjugacy and closed-form predictive densities: $x_n \mid z_n=k, \theta_k \sim \mathcal N(x_n \mid \mu_k, \Sigma_k)$ Sequential Monte Carlo (SMC) inference efficiently updates the posterior over potential partitions and cluster parameters as data arrives, using per–data point resampling and updating sufficient statistics. Emerging classes are automatically detected as new clusters, based purely on Bayesian predictive mass: if $\alpha\,p(x_{n+1}) > \max_j n_j\,p(x_{n+1} \mid D_j)$ , the new sample is most likely explained by a new class and the model's complexity increases (Dundar et al., 2012).

This formalism is extensible to other exchangeable nonparametric priors (e.g., Pitman–Yor process, normalized inverse-Gaussian process), and to alternative likelihood–prior pairs supporting efficient computation. Thus, the essential automation principle is realized: data complexity and model structure self-adapt online, with no need for user-tuned thresholds or ad hoc heuristics.

2. Bayesian Marginal Likelihood and Active Model Comparison

A recurring pillar is the use of marginal likelihood (model evidence) as the basis for model comparison and discovery: $Z_m = p(D \mid M_m) = \int p(D \mid \theta_m, M_m) \, p(\theta_m \mid M_m) \, d\theta_m$ However, in many scientific settings, model evidence integrals are expensive to estimate. The Active Bayesian Quadrature (Active-BQ) method leverages Gaussian process priors on model likelihoods and maximizes mutual information between prospective likelihood observations and the posterior probability of each model. Sample-efficient acquisition targets likelihood computations that most sharpen uncertainty on the model posterior: $I(\ell_i(\theta_i); z_1) = H(\ell_i(\theta_i)) - H(\ell_i(\theta_i) | z_1)$ Iteratively, new points $\alpha$ 0 are chosen by maximizing mutual information, and evaluated. Empirically, Active-BQ achieves 2–5× fewer likelihood calls than bridge sampling, reversible-jump MCMC, or vanilla Bayesian quadrature to reach the same posterior accuracy, particularly in model selection contexts where only the posterior on $\alpha$ 1 matters (Chai et al., 2019).

3. Automated Machine Learning (AutoML) via Bayesian Meta-Learning

Automated model discovery is foundational in AutoML systems, which aim for full automation in model and hyperparameter selection. Meta-learning approaches encode historical experimental results (meta-data) to construct predictive Bayesian priors over pipeline performance. In Adaptive Bayesian Linear Regression (ABLR), a neural basis $\alpha$ 2 embeds both pipeline and dataset meta-features; Bayesian linear regression over this basis enables predictive distributions (with closed-form mean and variance) over new pipelines. Acquisition is performed via expected improvement (EI), balancing exploration and exploitation. The meta-learned surrogate is adapted online to each new dataset through rapid, closed-form updates without retraining basis representations. Experiments show that ABLR+EI rapidly outperforms random search and established packages (e.g., Auto-Sklearn) in identifying high-performing pipelines, demonstrating efficient model discovery as guided by prior meta-data (Zhou et al., 2019).

Lifelong Bayesian Optimization (LBO) further generalizes this by placing correlated GP priors over the performance functions on an evolving sequence of datasets, using an Indian-Buffet Process prior to enable latent function sharing and automatic task-level transfer, with variational ELBO inference for scalability (Zhang et al., 2019).

4. Probabilistic Program Synthesis and Structure Learning

In expressive domains, the model space is itself the space of programs—structured objects generated by probabilistic grammars. In this setting, Bayesian synthesis formalizes model discovery as inference over probabilistic programs $\alpha$ 3 from a domain-specific language (DSL) defined by a probabilistic context-free grammar. The full Bayesian posterior

$\alpha$ 4

is approximated via MCMC: local program edits propose new candidates, with acceptance proportional to $\alpha$ 5 times a tree-size correction. This process is provably valid under reasonable conditions.

Synthesized models can capture structural properties of time series (e.g., periodicity, trend, change point), partitionings of variables in tables, and other qualitative attributes. Synthesis can be linked to a universal inference backend (e.g., Venture), enabling prediction and uncertainty quantification. Empirical results show superior accuracy in structure discovery and forecasting versus fixed model classes or standard heuristics (Saad et al., 2019).

5. Automated Bayesian Causal Structure and Equation Discovery

Automated Bayesian model discovery includes the task of learning causal relationships from observational data. Given $\alpha$ 6 variables, all (acyclic) graphical structures (DAGs) are considered. For the linear, acyclic case, the likelihood marginalizes over parameters: $\alpha$ 7 Priors are modular across nodes, and two modes of inference are supported: exhaustive scoring (for $\alpha$ 8) and greedy local moves (for $\alpha$ 9). Non-Gaussian errors (via Laplace estimates or Gaussian mixtures for error distributions) enable full identifiability of causal graphs beyond equivalence classes. The best graph or posterior over graphs is computed, with results calibrated and competitive with or superior to constraint-based and non-Gaussianity-based methods in empirical studies (Hoyer et al., 2012).

In symbolic regression, the search is over algebraic model structures generated by symbolic grammars. The Bayesian approach penalizes model complexity via explicit priors, uses the Laplace approximation for model evidence, and quantitatively links inference to information-theoretic (minimum description length) and statistical-physics (Boltzmann distribution) formalisms. MCMC over model structures allows full posterior quantification and model averaging. This procedure routinely avoids overfitting and enables ensemble-based prediction with quantified uncertainty, robust even in high noise (Guimera et al., 22 Jul 2025).

6. Autonomous Bayesian Model Building Agents and Interactive Discovery

Recent advancements involve autonomous agents that directly manipulate the specification of statistical models, given only performance and diagnostic feedback. AutoStan exemplifies such agents. Iteratively, the agent proposes edits to model code (e.g., in Stan), runs full Bayesian inference (MCMC), then judges changes via predictive log density (NLPD) on held-out data and diagnostics (divergences, $H$ 0, ESS). Accepted changes are those that improve NLPD and maintain valid convergence; otherwise, the system reverts. No explicit critic, domain knowledge, or manual search algorithm is provided. AutoStan reliably discovers non-standard model structures (robust likelihoods, mixture contamination, hierarchical partial pooling, non-centered parameterizations, etc.) across a range of regression and hierarchical modeling tasks, achieving competitive or superior performance to black-box neural baselines, with fully interpretable output (Dürr, 29 Mar 2026).

7. Automated Bayesian Model Discovery for Exploratory Data Analysis

Fully automated Bayesian density analysis can be realized through probabilistic circuit models such as Sum–Product Networks (SPNs), with full Bayesian inference over both latent structures (mixtures and factorizations) and data-type/likelihood families. ABDA, for instance, builds SPNs where each leaf is permitted to choose (via its own posterior) among a dictionary of plausible likelihoods, and structures are learned by co-clustering variables and instances using mutual-dependence tests. Gibbs sampling provides tractable, efficient inference over all latent structure, mixture weights, and likelihood parameters. The result is a single pipeline capable of discovering latent variable hierarchies, data types, dependencies, interpretable rules, and providing robust imputation and anomaly scoring—all under the Bayesian paradigm (Vergari et al., 2018).

In summary, automated Bayesian model discovery encompasses a spectrum of methods that formalize and automate all or part of the modeling process through Bayesian inference, ranging from nonparametric clustering and program synthesis to symbolic regression, causal graph search, AutoML pipeline selection, and full model code synthesis with predictive feedback. Core design elements include probabilistic model spaces, marginal likelihood-driven selection or averaging, posterior quantification over structures and parameters, scalable and often amortized or sequential inference, and objective performance or diagnostics as guiding signals. This approach provides principled guarantees of performance, adaptivity to data complexity, quantified uncertainty, and deep interpretability, positioning it as central to the development of self-constructing, scientifically credible machine learning systems. Key contributions are documented in (Dundar et al., 2012, Chai et al., 2019, Zhou et al., 2019, Vergari et al., 2018, Zhang et al., 21 Feb 2025, Demir et al., 2020, Saad et al., 2019, Hoyer et al., 2012, Guimera et al., 22 Jul 2025, Zhang et al., 2019), and (Dürr, 29 Mar 2026).