vPGM: Verbalized Probabilistic Graphical Modeling

Updated 22 March 2026

vPGM is a Bayesian prompting framework that verbalizes probabilistic graphical models, enabling latent variable reasoning using natural language.
The framework uses structured prompts to elicit latent dependencies and conditional probability distributions, resulting in improved confidence calibration and accuracy.
vPGM enhances interpretability by providing human-readable explanations of latent variables and their dependencies, supporting transparent, rigorous Bayesian inference.

Verbalized Probabilistic Graphical Modeling (vPGM) is a Bayesian prompting framework that enables LLMs to simulate the core principles of Probabilistic Graphical Models (PGMs) in natural language, without requiring expert-driven model design or specialized training. vPGM is guided by prompts that structure both the construction of latent variable dependencies and the verbalization of conditional distributions, facilitating interpretable latent-structure reasoning and rigorous Bayesian inference within LLMs. The framework achieves improved calibration of predictive confidence and competitive or enhanced accuracy in both closed-ended and open-ended reasoning tasks (Huang et al., 2024).

1. Bayesian Theoretical Foundations

vPGM is grounded in the canonical Bayesian graphical modeling paradigm, but diverges by externalizing the model’s structure and uncertainty articulation in natural language. The probabilistic backbone follows standard PGM formulations:

Latent Variable Prior: The joint prior is factorized over $n$ discovered latent variables as

$p(\mathbf{Z}) \equiv \prod_i p(Z_i \mid \mathrm{Pa}(Z_i))$

where $\mathrm{Pa}(Z_i)$ are the “parent” variables determined by the induced Bayesian network.

Likelihood: Observational distributions conditioned on latent structure are denoted as $p(\mathbf{X}\mid\mathbf{Z})$ , often decomposed per data modality.
Posterior: Inference proceeds with

$p(\mathbf{Z}\mid\mathbf{X}) \propto p(\mathbf{Z})p(\mathbf{X}\mid\mathbf{Z})$

Distinctively, vPGM frames each conditional probability distribution (CPD) in English prose within the prompt, followed by explicit numeric probabilistic statements. For example, $\mathrm{Pa}(Z_i)$ 4 The LLM then generates: $\mathrm{Pa}(Z_i)$ 5 All latent dependencies and their verbalized CPDs are constructed transparently through prompted LLM reasoning (Huang et al., 2024).

2. Prompting Methodology and Workflow

vPGM operationalizes PGM simulation through structured prompt templates, coupling PGM-discovery, dependency elicitation, and natural-language inference:

PGM Discovery Prompt: The model is first asked to identify a small set of latent variables ( $Z_1, \ldots, Z_n$ , where $n \leq 4$ ) relevant to the task. Each latent variable is explained in natural language, e.g., “Z₁: captures the relevance of external knowledge.”
Dependency Elicitation: A follow-up prompt requests the enumeration of directed edges, e.g., $X\to Z_1$ , $Z_1\to Z_3$ , encoding the structure of a verbalized Bayesian network.
Verbalized CPDs: Each factor $p(Z_i|\mathrm{Pa}(Z_i))$ is described textually, rather than choosing a parametric distribution, i.e., specifying conditions for higher or lower probabilities in English.
Inference Prompt (Multi-Latent Example): For $p(\mathbf{Z}) \equiv \prod_i p(Z_i \mid \mathrm{Pa}(Z_i))$ 0, $\mathrm{Pa}(Z_i)$ 6
End-to-End Chain-of-Thought: The LLM explicates its beliefs for each latent (e.g., “external knowledge is missing, so $p(\mathbf{Z}) \equiv \prod_i p(Z_i \mid \mathrm{Pa}(Z_i))$ 1”) and samples the latent configuration $p(\mathbf{Z}) \equiv \prod_i p(Z_i \mid \mathrm{Pa}(Z_i))$ 2 times. Marginalization is then carried out by averaging over samples:

$p(\mathbf{Z}) \equiv \prod_i p(Z_i \mid \mathrm{Pa}(Z_i))$ 3

Output Format: Produces both a final answer and its calibrated confidence, e.g., “Final Answer: B. Marginalized $p(\mathbf{Z}) \equiv \prod_i p(Z_i \mid \mathrm{Pa}(Z_i))$ 4.”

This approach makes explicit the otherwise opaque intermediate beliefs of LLMs, supporting transparent, structured decision-making (Huang et al., 2024).

3. Inference, Calibration, and Belief Updates

vPGM’s inferential fidelity and uncertainty quantification are realized through repeated sampling and explicit calibration measurement:

Confidence Calibration: vPGM reports model confidence alongside predictions, enabling empirical calibration analysis. Expected Calibration Error (ECE) is computed over $p(\mathbf{Z}) \equiv \prod_i p(Z_i \mid \mathrm{Pa}(Z_i))$ 5 bins as

$p(\mathbf{Z}) \equiv \prod_i p(Z_i \mid \mathrm{Pa}(Z_i))$ 6

where $p(\mathbf{Z}) \equiv \prod_i p(Z_i \mid \mathrm{Pa}(Z_i))$ 7 and $p(\mathbf{Z}) \equiv \prod_i p(Z_i \mid \mathrm{Pa}(Z_i))$ 8 are the empirical accuracy and the average confidence in bin $p(\mathbf{Z}) \equiv \prod_i p(Z_i \mid \mathrm{Pa}(Z_i))$ 9, respectively.

Posterior Expectation: When closed-form conditional probabilities are unavailable, vPGM approximates

$\mathrm{Pa}(Z_i)$ 0

Automated Belief Update: vPGM defines a pseudocode-based belief update loop: $\mathrm{Pa}(Z_i)$ 7 This systematic sampling and marginalization of verbalized beliefs provides quantitative uncertainty estimates and supports robust reasoning in settings with ambiguous or incomplete information (Huang et al., 2024).

4. Experimental Evaluation

The efficacy of vPGM is demonstrated across both closed- and open-ended reasoning tasks, with benchmarking against strong baselines.

Tasks and Datasets:
- ScienceQA (closed-ended, multi-modal): 4,241 questions requiring multi-hop reasoning over textual and multimodal evidence.
- ChatCoach (open-ended, medical-dialogue): 3,500 multi-turn dialogue turns with detection and correction subtasks.
Metrics:
- ScienceQA: Accuracy (% correct), ECE (%).
- ChatCoach: BLEU-2, ROUGE-L, BERTScore (for both detection and correction), and human ratings (PGM alignment, interpretability, clarity, constructiveness, overall).
Key Results:

Method	ScienceQA Acc	ScienceQA ECE	ChatCoach Det. BLEU-2	ChatCoach Corr. BERTScore
CoT	81.39	19.69	-	-
Chameleon	81.25	11.80	-	-
Chameleon+Self-Random+AvgConf	79.68	10.32	-	-
vPGM (Ours)	82.01	3.58	37.2	68.3

The vPGM framework achieved the highest accuracy and the lowest ECE in ScienceQA, and the highest detection BLEU-2 and correction BERTScore in ChatCoach, surpassing all baselines. Reliability diagrams showed vPGM calibration curves closely tracking the ideal diagonal (Huang et al., 2024).

5. Strengths, Limitations, and Prospective Directions

Strengths:
- vPGM markedly improves confidence calibration, reducing ECE from approximately 19% (CoT) to 3.6%.
- Explicit latent-structure reasoning enables detection of missing or irrelevant evidence and adjusts predictive confidence accordingly.
- Interpretability is enhanced through human-readable latent-variable explanations and natural-language descriptions of CPDs.
- The framework is training-free and does not require gradient-based optimization, instead relying on LLM capabilities steered by prompts.
Limitations and Open Questions:
- Prompt Engineering Sensitivity: The approach is reliant on careful design of PGM-discovery and inference prompts.
- Latent Count Trade-off: Increasing latent variable count can further improve calibration (lowest ECE for $\mathrm{Pa}(Z_i)$ 1) but may result in slight declines in raw accuracy compared to more compact latent structures (e.g., $\mathrm{Pa}(Z_i)$ 2). A plausible implication is that latent granularity and calibration are in tension in high-dimensional prompt settings.
- Prompt Automation: Current methodology necessitates manual prompt crafting; future research could investigate meta-learning or automated optimization for prompt template design.

6. Significance and Synthesis

vPGM demonstrates that “verbalizing” the structure and conditional relations of a Bayesian network—discovering graphical components and edges via language, explicitly rendering CPDs in prose, and guiding an LLM through a chain-of-thought corresponding to $\mathrm{Pa}(Z_i)$ 3—yields superior confidence calibration and competitive or improved accuracy in both closed- and open-ended tasks. These results suggest that integrating probabilistic reasoning principles directly into prompt engineering for LLMs circumvents the need for expert-driven or data-intensive PGM specification, offering a highly interpretable and robust alternative for tasks characterized by ambiguous or compositional inference requirements (Huang et al., 2024).

Markdown Report Issue Upgrade to Chat

References (1)

Verbalized Probabilistic Graphical Modeling (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Verbalized Probabilistic Graphical Modeling (vPGM).