Bayesian Program Learning (BPL)

Updated 27 November 2025

Bayesian Program Learning (BPL) is a probabilistic framework that infers structured, compositional programs from observed data using Bayesian inference.
BPL unifies program synthesis, abstraction discovery, and one-shot learning, facilitating robust applications like handwritten text recognition.
Extensions such as DreamCoder enhance BPL with neural proposal models and wake-sleep cycles to enable scalable and efficient generative inference.

Bayesian Program Learning (BPL) is a probabilistic framework for learning and inferring structured, compositional programs as concise explanations for observed data. BPL posits a joint generative model in which data are hypothesized to arise from latent programs, which themselves are endowed with program-like structure and stochasticity. The central inference problem is to recover these latent programs—often from very limited exemplars—by maximizing (or otherwise approximating) the posterior probability of candidate programs given the data. BPL’s methodology unifies program synthesis, probabilistic modeling, and structured abstraction learning under a Bayesian paradigm.

1. Bayesian Generative Models for Program Induction

At the core of BPL is a generative model that specifies a prior over programs and a likelihood function that explains how programs yield data. For instance, in the case of handwritten character learning, each character is generated by a program $\psi$ composed of a sequence of drawing primitives and part relations, which, when executed, trace the strokes of the character. The generative process can be written as:

$p(\psi, I) = p(\psi)\,p(I\,|\,\psi)$

where $p(\psi)$ is the prior over programs, defined compositionally in terms of the number and type of primitives and their relations, and $p(I|\psi)$ is the likelihood, possibly integrating over nuisance variables such as pose, scale, and noise:

$p(I\,|\,\psi) = \int p(\theta\,|\,\psi)\,p(I\,|\,\theta)\,d\theta$

The prior itself may be hierarchical, for instance,

$p(\psi) = p(\kappa)\prod_{i=1}^\kappa p(S_i\,|\,\kappa)\,p(R_i\,|\,S_{1:i-1},\kappa)$

with $\kappa$ the number of primitives, $S_i$ the identity of subparts, and $R_i$ the relational glue.

Program induction then aims to infer the posterior:

$p(\psi\,|\,I^{(1)}) \propto p(\psi)\,p(I^{(1)}\,|\,\psi)$

which is typically approximated via stochastic parsing, beam search, or neural proposal models, depending on the domain and scale (Souibgui et al., 2021, Ellis et al., 2020, Hwang et al., 2011).

2. One-shot and Low-resource Learning

A hallmark of BPL is its efficacy in “one-shot” or low-resource settings. Given a single exemplar $I^{(1)}$ of a novel concept (e.g., a glyph), BPL parses the sample into a program $\psi$ via random-walks over stroke graphs, normalization, and compositional priors. The resulting $\psi$ can then be executed multiple times with stochastic perturbations to produce a wide variety of synthetic tokens, enabling compositional generalization even with minimal supervision (Souibgui et al., 2021).

This approach is exemplified in low-resource handwritten text recognition, where a single image for each character is sufficient to induce generative programs, synthesize large amounts of data, and form the supervised basis for downstream recognition systems.

3. Program Transformation and Abstraction Discovery

BPL methods often include program transformation operators to discover abstractions and compress source programs. Bayesian program merging (Hwang et al., 2011) is one such approach, in which deterministic scaffolds fit to data are incrementally refactored through:

Abstraction via anti-unification: repeated subexpressions across examples are factored into new parameterized subroutines. This is achieved by searching for their most-specific generalization and rewriting with new function definitions.
Deargumentation: function arguments are eliminated when their values can be captured by parametric stochastic expressions, introducing recursion or probabilistic generators as appropriate.

Posterior probability is used to guide search:

$\log P(M\,|\,D) = \sum_{t\in D}\log P(t\,|\,M)\;-\;\alpha\,\mathrm{size}(M)\;+\;\mathrm{const.}$

where $M$ is the candidate program, $D$ the data, and model size penalizes unnecessary complexity. This leads to succinct, generalizable programs capturing recurrent structural motifs in the data.

4. Learning and Optimization in Hierarchical BPL Systems

Modern BPL systems, such as DreamCoder (Ellis et al., 2020), extend the classical paradigm by integrating wake-sleep cycles, neural proposal models, and hierarchical library discovery:

Wake phase: Inference is performed over real tasks, using a recognition model $q_\phi(p | D)$ to propose candidates and assemble beams of high posterior programs.
Sleep–G phase (Generative/Abstraction learning): Successful solutions are refactored to discover reusable abstractions, expanding the DSL (domain-specific language) with new primitives. A minimum description length (MDL) criterion governs the tradeoff between DSL size and program parsimony:

$\mathcal{L}^* = \arg\min_{𝔏} \left[ -\log P(𝔏) + \sum_i \min_{p \in \mathcal{R}(p^*_i)}\{-\log P(p;𝔏) - \log P(D_i | p)\} \right]$

Sleep–R phase (Recognition/Dreaming): The recognition model $q_\phi$ is trained on both “replay” pairs from the task beams and “fantasy” pairs sampled from the prior over programs and their data.

This cycle—alternately improving inference (recognition), model structure (abstractions in the DSL), and coverage of the generative process—enables transfer to new tasks and interpretable representation growth.

5. Applications in Handwritten Text Recognition

BPL-driven pipelines achieve state-of-the-art results in segmentation-free handwritten text recognition under extreme data scarcity (Souibgui et al., 2021). The typical workflow comprises:

Parsing one-shot symbol exemplars into program representations $\{\psi_c\}$ for each class.
Generating diverse synthetic exemplars by executing $\psi_c$ multiple times with sampled nuisance parameters $\theta_{c,j}$ .
Synthesizing complete line images by compositional concatenation, randomized spacing, rotation, and noise injection.
Training advanced sequence models (such as attention-based Seq2Seq or few-shot matching networks) on large BPL-augmented datasets.
Employing data augmentation (elastic distortion, thinning, contrast jitter) in tandem with BPL variability.

Quantitatively, few-shot fine-tuned recognition systems with BPL-generated data outperform baseline unsupervised and supervised methods on challenging cipher corpora, with Symbol Error Rates (SER) achieving 0.20 under mixed real and synthetic training—a new state-of-the-art for the Borg cipher (Souibgui et al., 2021).

The original BPL (Lake et al. 2015) utilized hand-engineered grammars and full Bayesian inference for tasks such as handwritten character induction. Extensions, such as DreamCoder, automated DSL construction, leveraged amortized neural recognition for faster inference, and enabled scalable abstraction discovery across diverse domains, including functional programming and physics law induction (Ellis et al., 2020).

Classical BPL: Relies on fixed grammars and explicit Bayesian inference per new data case.
Bayesian Program Merging: Focuses on abstraction discovery and deargumentation given an ADT-aligned program space, without reliance on a hand-built grammar (Hwang et al., 2011).
DreamCoder: Integrates neural proposal, library learning, and wake-sleep iteration for flexible, multi-domain abstraction learning.

Limitations persist in likelihood estimation overhead, necessity of structurable/high-bias domains, and scalability to high-dimensional or weakly-structured data, but the methodology remains central for domains where structured programmatic explanations are desired.

7. Methodological Summary and Takeaways

A summary algorithmic structure for BPL-based generation in low-resource settings (Souibgui et al., 2021):

Input: One-shot symbol images per class $\{I_c\}$ .
Symbol Induction:

Parse $I_c$ to obtain candidate $\psi^{[1..K]}$ .
Select $\psi_c = \arg\max_i [\log p(\psi^{[i]}) + \log p(I_c\,|\,\psi^{[i]})]$ .
For $j=1..M$ , sample $\theta_{c,j} \sim p(\theta | \psi_c)$ and render $I_{c,j}$ .

Synthetic Line Generation:

Choose exemplars $I_{c_t,j_t}$ for each position in the symbol sequence.
Apply random spacing, rotation, and noise to compose line image $L$ .
Augment as needed for downstream recognition.

BPL systematizes program induction: data incorporation, transformation search (abstraction, deargumentation), and posterior scoring constitute a general template for learning structured generative representations. With extensions integrating neural amortization and automatic abstraction discovery, BPL underpins modern, interpretable solutions for low-shot learning and program synthesis across a diverse range of tasks (Souibgui et al., 2021, Ellis et al., 2020, Hwang et al., 2011).