Papers
Topics
Authors
Recent
2000 character limit reached

Positional Grammars: Formulations & Applications

Updated 15 January 2026
  • Positional grammars are formal systems that encode both the order and relative positions of elements in strings, trees, or graphs to capture complex structural relationships.
  • They extend classical context-free grammars by incorporating span variables, binary connectors, and forbidden factors to enhance expressiveness in information extraction and graph parsing.
  • Applications include document extraction, structured data parsing, and evaluation of large language models, underscoring their significance in computational linguistics and formal language theory.

A positional grammar is a formal system in which the generative process for strings, trees, or graphs encodes not only the sequence or structure of objects but also their positions or relationships relative to other elements. These grammars arise in multiple research traditions: as context-free extraction grammars for document spanners, as context-free positional grammars for hypergraph and graph parsing, as surface-linear (regular) grammars capturing non-hierarchical rules in LLMs, and as subregular models formulated over enriched string structures with partial orders. This article surveys formal definitions, expressiveness, parsing algorithms, learnability, and applications across these frameworks.

1. Extraction Grammars and Span-Variable Semantics

Extraction grammars extend ordinary context-free grammars (CFGs) by introducing explicit span variables that mark the start and end positions of substrings within documents. Formally, an extraction grammar is a tuple G=(X,V,Σ,P,S)G=(X, V, \Sigma, P, S) where XX is a finite set of span variables, VV a nonterminal set, Σ\Sigma the terminal alphabet, PP the set of productions (possibly containing special "variable operation" symbols #<x,#>x\#<x, \#>x marking opening/closing of variables), and SS the start symbol. A derivation over the extended alphabet produces ref-words, i.e., strings in (ΣΓX)(\Sigma\cup\Gamma_X)^*, where ΓX={#<x,#>xxX}\Gamma_X = \{\#<x, \#>x \mid x\in X\}.

A ref-word rr is valid if for each variable xXx\in X, #<x\#<x and #>x\#>x occur exactly once and in correct nesting. Erasing all variable operations yields the underlying document string dd. The semantics is that each valid ref-word rr induces an assignment of document spans μr(x)=[i,j)\mu^r(x) = [i, j) where #<x\#<x occurs just before position ii and #>x\#>x just after j1j-1 in dd. The grammar thus defines a relation G(d)\llbracket G \rrbracket(d), returning possible span assignments for dd (Peterfreund, 2020).

This mechanism captures salient substrings during generation and enables declarative information extraction far beyond standard regex-based approaches. For example, grammars can specify nested, interleaved, or equal-length captured spans not representable in regular spanner formalisms.

2. Positional Grammars for Hypergraphs and Graph Parsing

Context-free positional grammars (PGs) generalize the idea of positional annotation from strings to arbitrary combinatoric structures such as hypergraphs. A PG is formally G=(N,Σ,P,S)G = (N, \Sigma, P, S) where productions have the form p:Ai=1n(#2ki,zi,liXi)p: A \to \prod_{i=1}^n (\#_2\langle k_i, z_i, l_i\rangle X_i), with XiX_i a nonterminal or terminal and #2ki,zi,li\#_2\langle k_i, z_i, l_i\rangle a binary connector carrying interface and offset information. Binary connectors encode how the interfaces (attachment points) of each symbol on the production's right-hand side are joined to those already present.

Well-formedness constraints ensure that connectors define unique, symmetric, and transitive interface relations, and that the entry points of a nonterminal are correctly mapped to those on the leftmost RHS symbol—a requirement critical for deterministic LR parsing (Costagliola et al., 7 Jan 2026).

A key advance is the reduction of hyperedge replacement grammar (HRG) productions to PG productions, possibly with traversals reordered by permutations. Duplicate or permuted productions combat ambiguity in graph recognition (distinct parses of the same hypergraph structure). Parsing proceeds via an LR(0)-style item-based algorithm, maintaining both production progress and the mapping of interfaces; time complexity is O(m+P2)O(m + |P'|^2) for a graph of mm hyperedges and an augmented production set P|P'|.

3. Linear and Positional Grammars in Language Modeling

Positional grammars are also studied in the context of LLMs as a class of linear or surface positional rules. These grammars specify regular (finite-state) constraints on token sequences, such as fixed-position insertion, inversion (word order reversal), or language-specific positional inflection (e.g., suffixation at a non-head position).

These can be written as regular grammars with productions like Sw1w2w3doesn’tw4S\to w_1w_2w_3\text{doesn't}w_4 (fixed negation in English), Sw4w3w2w1S\to w_4w_3w_2w_1 (full inversion), or language-specific variants. Experiments reveal that LLMs are systematically less accurate at recognizing and generalizing such linear/positional patterns than hierarchical (context-free) ones, and that the implementation of these rules draws on causally disjoint model subnetworks, with less than 6% overlap between linear- and hierarchy-selective circuits (Sankaranarayanan et al., 15 Jan 2025). These subnetworks generalize even to nonce (meaningless) vocabularies, indicating that the learned routines are genuinely structural.

4. Model-Theoretic and Subregular Positional Grammars

The subregular tradition formalizes positional grammars via forbidden factors in enriched string models. Given a finite alphabet Σ\Sigma and unary properties U={P1,...,Pn}U=\{P_1, ..., P_n\}, an enriched string model M(w)M(w) is a structure over the domain D(w)={1,...,w}D^{(w)} = \{1, ..., |w|\}, with the usual order and unary property relations Pk(w)P_k^{(w)} recording which positions have property PkP_k. A "positional grammar" in this setting is a finite set GG of forbidden substructures ("factors") of size at most kk: L(G)={wΣgG,g⋢M(w)}L(G) = \{w \in \Sigma^* \mid \forall g\in G, g \not\sqsubseteq M(w)\}, where gM(w)g \sqsubseteq M(w) means gg appears as an induced factor.

This framework generalizes Strictly kk-Local (SLk_k) and Strictly kk-Piecewise (SPk_k) languages by supporting multiple, cross-cutting properties at each position (e.g., both letter class and capitalization), and defines classes of regular languages via finite forbidden sets in the factor poset (Chandlee et al., 2019).

A bottom-up algorithm for learning such grammars from positive data exploits the partial order of the factor lattice to prune the hypothesis space, ensuring efficient inference provided the data is sparse relative to the space of possible factors.

5. Expressiveness, Evaluation, and Ambiguity

Extraction positional grammars (context-free with span variables) strictly extend regular spanner languages: all regular spanners are captured by right-linear grammars (productions of the form AσBA\to\sigma B or AσA\to\sigma, σΣΓX\sigma\in\Sigma\cup\Gamma_X), but context-free positional grammars can express nested, balanced, and equal-length spans that are unattainable with regular spanners. The class of context-free spanners is incomparable with "core spanner" classes based on regular patterns plus string equality.

In graph parsing, ambiguity bifurcates into generation ambiguity (distinct derivations yield isomorphic outputs) and recognition ambiguity (multiple parses of the same input structure). The adoption of permutations and positional anchoring in PGs enables the elimination of recognition ambiguity—at the cost of grammar size expansion—while generation ambiguity, inherent to the base formalism, persists (Costagliola et al., 7 Jan 2026).

For extraction grammars, parsing can be carried out in polynomial time in document length (degree polynomial in the number of variables), and for unambiguous grammars, all extracted assignments can be enumerated after quintic preprocessing with constant delay per output (Peterfreund, 2020). For positional string grammars as forbidden-factor systems, language membership remains regular and efficiently testable; the complexity of learning forbidden sets depends on the size and density of the factor space.

6. Applications and Theoretical Significance

Positional grammars find applications across information extraction (typed spans in documents), parsing of structured and semi-structured data (source code, XML/JSON with nested tags), hypergraph and semantic graph parsing, phonological/biological string patterning, and formal study of neural representations in artificial and biological networks. Their ability to encode structural properties that are surface-local, precedence-based, or hierarchically nested allows them to bridge expressive gaps left by classical regular grammars and context-free grammars alone.

Experimental evidence from LLMs demonstrates that positional (linear) rules are neurally dissociable from hierarchical rules, suggesting that both kinds of structural processing can coexist and be learned from data, with distinct causal mechanisms (Sankaranarayanan et al., 15 Jan 2025). The subregular, forbidden-factor perspective further provides a learnable and generalizable formalism for diverse feature-rich symbolic domains (Chandlee et al., 2019).

7. Summary Table: Main Positional Grammar Formalisms

Formalism Positional Mechanism Expressive Power/Class
Extraction grammars (Peterfreund, 2020) Span-variable operations (#<x, #>x) Context-free spanners; strictly extends regular spanners
Context-free positional grammars (Costagliola et al., 7 Jan 2026) Binary connectors in hypergraph/graph parsing Context-free, enables LR parsing over graphs/hypergraphs
Regular/linear grammars in LLMs (Sankaranarayanan et al., 15 Jan 2025) Surface-position constraints on token order Regular languages; non-hierarchical surface constraints
Partially ordered subregular grammars (Chandlee et al., 2019) Forbidden factors in enriched string models Subregular classes: SLk_k, SPk_k; regular languages with cross-cutting features

These approaches collectively define the current theoretical and practical landscape of positional grammars as deployed in formal language theory, natural language processing, and computational learning.

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Positional Grammars.