Conditional PMI (C-PMI): A Concise Overview

Updated 18 February 2026

Conditional PMI (C-PMI) is a statistical measure that extends traditional PMI by conditioning on external variables to capture context-specific dependencies.
It leverages conditional log-probabilities from language or vision-language models to compute nuanced associations, employing efficient token-wise strategies for rapid evaluation.
Its applications span dialogue modeling, faithfulness evaluation, and bias analysis, offering interpretable metrics that improve generation quality and diagnostic insights.

Conditional Pointwise Mutual Information (C-PMI) is a fine-grained statistical measure for quantifying the dependence between discrete random variables, conditioned on observed context. It generalizes pointwise mutual information (PMI) to account for an explicit conditioning variable, capturing the degree to which the co-occurrence or causal structure between variables is modulated by external information. Modern research applies C-PMI in dialogue modeling, faithfulness metrics, language modeling, interpretability, and cross-modal hallucination mitigation, providing explicit measurements of context-sensitive dependencies in both generation and evaluation tasks.

1. Mathematical Definitions and Formulations

C-PMI extends classical PMI by evaluating the association between two variables, $A$ and $B$ , conditioned on a third variable $C$ : $\mathrm{C\text{-}PMI}(A;B \mid C) = \log \frac{P(A, B \mid C)}{P(A \mid C) P(B \mid C)}$ or, equivalently,

$\mathrm{C\text{-}PMI}(A;B \mid C) = \log P(A \mid B, C) - \log P(A \mid C).$

This formulation quantifies the additional information $B$ provides about $A$ when $C$ is known. Derivatives and variants of this definition appear in various domains. For example, in document-grounded dialogue generation, $r$ denotes the generated response, $d$ the grounding document, and $h$ the dialogue history, yielding: $\mathrm{C\text{-}PMI}(r; d \mid h) = \log \frac{P(r, d \mid h)}{P(r \mid h) P(d \mid h)} = \log P(r \mid d, h) - \log P(r \mid h).$ This form is directly computable using the log-likelihoods provided by pretrained LLMs applied to different prompt conditions (Nandwani et al., 2023). In turn-level evaluation, threeway instantiations generalize C-PMI as: $\mathrm{C\text{-}PMI}(r, x \mid h) = \log \frac{p(r, x, h) p(h)}{p(r, h) p(x, h)}$ where $r$ is preceding context, $x$ is response, and $h$ encodes an evaluation hypothesis (Ren et al., 2023). In cross-modal generation, variants include conditioning on both text and vision (e.g., LVLMs) (Fang et al., 26 May 2025).

2. Implementation Strategies and Computational Details

C-PMI metrics rely on conditional (and joint) log-probabilities estimated by LLMs or multimodal models. In document-grounded response generation, probabilities such as $P(r \mid d,h)$ and $P(r \mid h)$ are computed by a single forward pass of a model twice with different conditions; for sequence models, joint probabilities factorize autoregressively: $P(r \mid d, h) = \prod_{t=1}^T P(r_t \mid d, h, r_{<t}), \quad P(r \mid h) = \prod_{t=1}^T P(r_t \mid h, r_{<t}).$ For dialogue evaluation, C-PMI can be approximated as: $\mathrm{C\text{-}PMI}(c, r \mid h) = LL(c \oplus r \oplus h) + LL(h) - LL(c \oplus h) - LL(r \oplus h),$ where $LL(\cdot)$ is the normalized log-likelihood and $\oplus$ denotes text concatenation (Ren et al., 2023). For corpus-level bias analysis, C-PMI is computed as simple log-ratios of co-occurrence probabilities derived from observed counts and is often approximated using the log-odds ratio under the assumption of small probabilities (Valentini et al., 2021). For time-critical decoding, token-wise C-PMI is calculated on-the-fly by considering only the top-p tokens by likelihood and re-ranking by their contribution to C-PMI at the next decoding step (Nandwani et al., 2023, Fang et al., 26 May 2025). No explicit out-of-model smoothing is required under contemporary LLMs, though calibration and model quality are essential for metric reliability.

3. C-PMI in Evaluation Metrics and Faithfulness Measurement

C-PMI has become foundational for designing faithful automatic evaluation metrics in dialogue and generation:

PMI-Faith: Defined as $\mathrm{C\text{-}PMI}(r; d \mid h)$ and used to automate faithfulness assessments of generated responses to grounding documents (Nandwani et al., 2023). Binary classification is achieved by normalizing C-PMI scores, selecting a threshold on a dev set, and applying this to test data, outperforming prior metrics (accuracy: 0.834, F1: 0.697).
Turn-level Dialogue Evaluation: Embedding C-PMI in turn-level metrics by conditioning on conversational hypotheses (e.g., "That makes sense.") yields strong correlation with human judgments (relative 62.6% gain in average Spearman correlation over NLL for FED) (Ren et al., 2023).
Bias Metrics: C-PMI underpins corpus-level lexical bias analyses by providing transparent, statistically interpretable measures for quantifying conditional word associations and their statistical significance (e.g., female context/nurse bias of $1.32$ in Wikipedia) (Valentini et al., 2021).
Conditional Informative Metrics: In dialogue, pointwise conditional mutual information (e.g., PCMI $_h$ ) disambiguates genericity from specificity by isolating the unique signal that additional context provides (Paranjape et al., 2021).

4. Decoding and Generation with C-PMI-Augmented Objectives

Modern decoding strategies integrate C-PMI directly into generation objectives to bias output towards contextually grounded content:

PMI-D Decoding: The generation objective optimizes

$\hat{r} = \arg\max_r \{ \log P(r \mid d, h) + \alpha \cdot \mathrm{C\text{-}PMI}(r; d \mid h) \}$

with $\alpha$ a tunable hyperparameter. The corresponding stepwise update is

$\underbrace{\log P(v \mid d, h, r_{<t})}_{\text{Likelihood}} + \alpha \cdot [\log P(v \mid d, h, r_{<t}) - \log P(v \mid h, r_{<t})].$

This method increases faithfulness without notable degradation in fluency or relevance (Nandwani et al., 2023).

Vision-LLMs: Technologies such as CMI-VLD in LVLMs alternate text token selection (by maximizing C-PMI) and dynamic image token refinement to maintain cross-modal dependency at every generation step. Token purification mechanisms employ Gumbel-Softmax for differentiable token masking and joint maximization (Fang et al., 26 May 2025).
Dialogue Response Selection: Fused-PCMI selects candidates by maximizing a trade-off between informativeness (PMI) and acknowledgement (PCMI $_h$ ), operationalized as $g^* = \arg\max_i [ \mathrm{PMI}_i + \lambda \mathrm{PCMI}_{h,i} ]$ with empirical thresholds (Paranjape et al., 2021).

5. Comparisons, Interpretability, and Statistical Properties

C-PMI offers several interpretability and methodological advantages:

Transparency and Statistical Inference: C-PMI metrics are rooted in observable first-order co-occurrence (or likelihood) statistics. They enable direct connection to 2×2 contingency tables and thus classical confidence intervals, standard error estimation, and p-values for hypothesis testing (Valentini et al., 2021). For example, log-odds ratios and associated confidence intervals can be computed analytically, whereas embedding-based bias metrics lack such end-to-end transparency.
Contrast with Black-Box Methods: Embedding-based measures (e.g., cosine distances in SGNS or GloVe space) capture higher-order dependencies but do not yield explicit, interpretable statistics for the association between variables. C-PMI maintains a direct connection to counts or model-derived probabilities.
Limitations: C-PMI requires high-quality conditional probability estimates; in practice, this is contingent on the calibration of underlying language or vision-LLMs. Computational cost scales with the need for multiple probability queries.

6. Empirical Outcomes and Applications

C-PMI has demonstrated empirical advantages:

In document-grounded dialogue, PMI-Faith significantly exceeds baselines and prior art in faithfulness classification and aligns more closely with human evaluators (Nandwani et al., 2023).
For dialogue turn-level evaluation, C-PMI metrics raise automated–human judgment correlation across multiple conversational quality dimensions (Ren et al., 2023).
In vision-language generation, C-PMI-calibrated decoding (CMI-VLD) reduces hallucination metrics (CHAIR, POPE) by up to 7% relative and increases accuracy/F1 on adversarial splits, with detailed hyperparameter studies (Fang et al., 26 May 2025).
For bias measurement, C-PMI provides robust, statistically analyzable quantification of textual bias that correlates well with extrinsic occupation/gender statistics (Valentini et al., 2021).
In dialogue augmentation, Fused-PCMI shifts response selection towards improved human-perceived acknowledgement quality (74% raters preferred) without sacrificing informativeness (Paranjape et al., 2021).
For language modeling, C-PMI underlies negative sampling approaches that, with softmax recalibration at inference, match or surpass the perplexity of noise-contrastive estimation methods (Melamud et al., 2017).

7. Extensions, Limitations, and Prospects

Notable strengths of C-PMI include:

Model-agnosticism and lack of reliance on references or supervised data (in dialogue evaluation).
Compatibility with both generation and evaluation modules.
Linearity and ease of interpretability at both global (corpora) and local (token or turn-level) scales.

Limitations include:

Computational overhead from repeated probability queries (mitigated via batching or caching).
Dependence on LM calibration and coverage—model biases propagate into C-PMI estimates.
The need for well-chosen dimensions or hypotheses for conditional evaluation; for dialogue, coverage is tied to prewritten templates.

Current research directions propose integrating C-PMI directly into training objectives (not just post hoc evaluation or decoding), extending to multi-turn and dialogue-level aggregations, cross-modal alignment in LVLMs, and algorithmic optimizations for inference speed.

C-PMI thus serves as a conceptually rigorous and practically effective tool at the intersection of natural language understanding, text generation, interpretability, and algorithmic alignment, systematically quantifying the conditional dependencies that underwrite faithfulness, specificity, and bias in both unimodal and multimodal AI systems (Nandwani et al., 2023, Valentini et al., 2021, Ren et al., 2023, Fang et al., 26 May 2025, Paranjape et al., 2021, Melamud et al., 2017).