Probabilistic Context-Sensitive Grammars

Updated 3 March 2026

Probabilistic Context-Sensitive Grammars (PCSGs) are defined by integrating context into production rules, overcoming the independence assumptions of PCFGs.
PCSGs combine context-free and context-sensitive rule probabilities using a parameter q to balance traditional and contextual influences, with measurable effects via mutual information.
The PC-LCFRS subclass enables efficient parsing of discontinuous structures, demonstrating practical relevance in modeling complex language phenomena.

Probabilistic Context-Sensitive Grammars (PCSGs) generalize the probabilistic context-free grammar (PCFG) framework by modeling distributions over trees in which the distribution of a subtree can depend on the context in which its root appears. This extension overcomes the fundamental limitation of PCFGs—that the expansion of a nonterminal is independent of its neighbors—and allows PCSGs to capture a broader class of structural dependencies relevant to natural language and other structured phenomena. Formal analyses demonstrate that while the marginal distributions over symbols in PCSGs change continuously with the degree of context-sensitivity, context-inducing correlations and independence-breaking effects arise, measurable via mutual information and novel tree-structured metrics not present in PCFGs (Nakaishi et al., 2024). Within the PCSG hierarchy, mildly context-sensitive systems such as probabilistic linear context-free rewriting systems (PC-LCFRS) are of particular practical interest, enabling efficient parsing and parameter learning while extending expressiveness beyond PCFGs (Yang et al., 2022).

1. Formal Definition and Parameterization

A PCSG is defined by a grammar $G=(N,\Sigma,R,S)$ , where:

$N$ is a finite nonterminal set (e.g., $N=\{0,1\}$ in (Nakaishi et al., 2024))
$\Sigma$ is the set of terminal symbols (possibly $\emptyset$ in simplified models)
$S\in N$ is the start symbol
$R$ $R$ contains context-free (CF) and context-sensitive (CS) production rules:
- CF-rules: $A \to B C$ , with $A,B,C\in N$
- CS-rules: $L\,A\,R\to L\,B\,C\,R$ , where $L,R\in N\cup\{\lambda\}$ , $B,C\in N$

The probability of applying a rule is governed by two families of nonnegative weights: $M^\text{CF}_{A\to BC}$ and $M^\text{CS}_{LAR\to L B C R}$ , and a context-sensitivity parameter $q\in[0,1]$ . For a rewriting of symbol $A$ in context $(L,R)$ , the probabilities are:

$P_\text{CF}(A\to BC) = (1-q)\,M^\text{CF}_{A\to BC}$
$P_\text{CS}(L\,A\,R\to L\,B\,C\,R) = q\,M^\text{CS}_{LAR\to L B C R}$

where $\sum_{B,C} M^\text{CF}_{A\to BC} = 1$ for each $A$ and $\sum_{B,C} M^\text{CS}_{LAR\to L B C R} = 1$ for each $L,A,R$ configuration.

Setting $q = 0$ recovers a standard PCFG; with $q > 0$ true context-sensitive interactions are introduced (Nakaishi et al., 2024).

2. Generative Process and Derivation Probabilities

The PCSG generative process proceeds as follows:

Initialize with a root node, $A_0 \sim \text{Uniform}(N)$ .
For $D$ levels, maintain a current frontier of nonterminals. At each level, randomly permute the frontier and, for each position $i$ , select a rewriting rule for $A_i$ using either a CF-rule or CS-rule, conditioned on its left and right neighbors $(L, R)$ .
The frontier doubles at each level, yielding a binary tree of depth $D$ .

For a derivation tree $T$ , if rules $r_t$ are applied at each step to node $A_t$ in context $(L_t, R_t)$ , the generative probability is:

For PCFG: $P_\text{PCFG}(T) = \prod_{t=1}^{|T|} P(r_t\,|\,A_t)$
For PCSG: $P_\text{PCSG}(T) = \prod_{t=1}^{|T|} P(r_t\,|\,A_t, L_t, R_t)$

This dependency on local context is the principal distinction from PCFGs (Nakaishi et al., 2024).

3. Marginal and Correlational Properties

The single-node marginal distribution for node $i$ and symbol $A$ is $\pi_{A,i}(q,M) = \mathbb{E}_{q,M}[1_{\sigma_i = A}]$ . In PCFGs ( $q=0$ ), $\pi_{A,i}$ converges exponentially to a unique fixed point $\pi^*$ with depth due to the Markov property. In PCSGs ( $q>0$ ), $\pi_{A,i}(q,M)$ remains an analytic function of $q$ for any finite tree, showing smooth dependence on context-sensitivity without qualitative phase transitions in marginals.

A novel finding is that simple marginal statistics do not capture the qualitative effects of context-sensitivity. Instead, mutual information and independence-breaking metrics between nodes exhibit distinct behaviors in PCSGs that do not arise in PCFGs (Nakaishi et al., 2024).

4. Context-Induced Mutual Information and Independence Breaking

To characterize long-range dependencies, PCSGs utilize measures such as:

Mutual Information: For two frontier nodes $i, j$ :

$I_{i,j}(q,M) = \sum_{x,y\in N} P(\sigma_i=x, \sigma_j=y) \log \frac{P(\sigma_i=x, \sigma_j=y)}{P(\sigma_i=x)P(\sigma_j=y)}$

In PCFGs $(q=0)$ , $I_{i,j}$ decays exponentially in tree-structural distance $d_{\text{struct}}(i,j)$ . In PCSGs $(q>0)$ , due to context-sharing rules, a new effective distance $d_\text{eff}(i,j)$ —allowing lateral (neighbor-to-neighbor) steps—governs the decay:

$I_{i,j}(q>0, M) \approx C\, \exp\left( -\frac{d_\text{eff}(i,j)}{\xi} \right)$

Parent-Fixed Mutual Information ( $J$ metric): Quantifies the breaking of context-free independence via:

$J_{i,j;A,B}(q, M) = \sum_{k,l,m,n} P(\sigma_k, \sigma_l, \sigma_m, \sigma_n | \sigma_i=A, \sigma_j=B) \ln\left[\frac{P(\sigma_k, \sigma_l, \sigma_m, \sigma_n | A, B)}{P(\sigma_k,\sigma_l|A,B) P(\sigma_m,\sigma_n|A,B)}\right]$

where $k,l$ are children of $i$ ; $m,n$ are children of $j$ . For the PCFG case, $J \equiv 0$ ; for $q>0$ , $J > 0$ decays exponentially in $d_\text{eff}$ (Nakaishi et al., 2024).

These metrics directly quantify the extent to which context sensitivity introduces interdependence between disparate regions of the derivation tree.

5. Comparison to Mildly Context-Sensitive Systems: PC-LCFRS

A key subclass of PCSGs is given by probabilistic linear context-free rewriting systems (PC-LCFRS). An LCFRS is a tuple $G=(\Sigma, N, V, S, R)$ , where each nonterminal $A$ rewrites into $f(A)$ spans with rules of the form:

$A(x_1, \ldots, x_f) \to (\alpha_1, \ldots, \alpha_f)$

With an associated probability distribution $\{\theta_r: r \in R(A)\}$ , a probabilistic LCFRS induces a tree-valued distribution over yields, generalizing PCFGs to discontinuous structures (Yang et al., 2022).

For binary, fan-out-$2$ PC-LCFRS (LCFRS-2):

Parsing complexity is $O(n^5)$ after discarding $O(n^6)$ time rules, with minimal empirical loss in coverage,
Parameterization via tensor decomposition and neural embeddings enables scaling to large nonterminal sets,
Maximum-likelihood training is conducted via inside–outside algorithms adapted to rank-space implementations for efficiency,
In empirical applications, LCFRS-2 attains $\approx87\%$ coverage of discontinuous constituents in German treebanks (Yang et al., 2022).

PC-LCFRS thus exemplifies a tractable, practical, mildly context-sensitive instantiation of the PCSG paradigm.

6. Concrete Examples and Practical Relevance

Concrete construction elucidates the independence-breaking effect of context-sensitive rules. In the PCFG scenario ( $q=0$ ), e.g., with $M^\text{CF}_{0\to 01} = M^\text{CF}_{1\to 10} = 1/2$ , all mutual dependencies decay rapidly, and $J = 0$ . In a PCSG ( $q=1$ ) with a high-weight CS rule (e.g., $010 \to 0000$ ), horizontal "channels" of dependence arise. Distant nodes exhibit information flow not only through their ancestral chain but also through their horizontal neighbors, modulating decay rates of both mutual information and $J$ as functions of $d_\text{eff}$ rather than pure graph distance (Nakaishi et al., 2024).

This property allows PCSGs to model phenomena such as discontinuous syntactic constructions in natural language, where context-free models are inadequate.

7. Expressivity, Complexity, and Implications

PCSGs strictly subsume PCFGs in expressivity by allowing the local context to determine production rule probability, thereby breaking the factorization properties that hold in context-free models. While single-node marginals do not exhibit qualitative transitions, context-sensitive correlations measurable by $I$ and $J$ introduce a rich spectrum of behaviors, including new correlation lengths. PCSGs remain amenable to simulation and, in subclasses such as PC-LCFRS-2, allow polynomial-time inference and learning.

A plausible implication is that metrics like parent-fixed mutual information serve as operational order parameters for the degree of context sensitivity present in tree-generating processes, providing avenues for both formal investigation and practical linguistic annotation (Nakaishi et al., 2024, Yang et al., 2022).

Markdown Report Issue Upgrade to Chat

References (2)

Statistical properties of probabilistic context-sensitive grammars (2024)

Unsupervised Discontinuous Constituency Parsing with Mildly Context-Sensitive Grammars (2022)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Probabilistic Context-Sensitive Grammars (PCSGs).