Constituency Parse Extraction from PLMs

Updated 1 February 2026

CPE-PLM is a method that extracts full constituency parse trees directly from frozen language models without explicit parser training.
It utilizes sequence labeling and chart-based extraction with self-attention, embeddings, and perturbation signals to accurately reconstruct binary trees.
Empirical evaluations show that CPE-PLM achieves competitive performance with unsupervised PCFGs and transfers effectively across multiple languages and low-resource settings.

Constituency Parse Extraction from Pre-trained LLMs (CPE-PLM) refers to the set of methodologies that induce full constituency parse trees directly from the internal representations of frozen LLMs without the need for explicit supervised parser training. These methods are typically parameter-free at extraction time or rely on minimal probing layers atop contextual encoders. CPE-PLM leverages features such as self-attention heads, contextual embeddings, and representation perturbation signals to construct binary constituency trees, often achieving performance competitive with unsupervised probabilistic parsers and robust enough for application across multiple languages and grammars.

1. Formal Definitions and Core Encodings

The CPE-PLM paradigm encompasses two main families: sequence-labeling probes and chart-based extraction from span or distance scores.

Sequence Labeling Encodings: A bijective mapping is defined between a constituency tree $T$ and a sequence of local labels $\ell_i = (n_i, C_i, u_i)$ for each position $i$ , where $n_i$ is the change in shared ancestor depth across word boundaries, $C_i$ is the lowest common ancestor nonterminal, and $u_i$ is a unary chain encoding (Vilares et al., 2020, Muñoz-Ortiz et al., 2023). Such encodings are lossless and reconstruct the tree exactly in $O(n)$ time.
Span/distance-based chart extraction: For a sentence of $n$ tokens, all $O(n^2)$ contiguous spans $(i,j)$ are assigned scalar scores $\ell_i = (n_i, C_i, u_i)$ 0 measuring the likelihood of constituenthood. These scores derive from pairwise attention-head distances (using metrics like Jensen–Shannon or Hellinger divergence between distributions), average token-level distortions under syntactic perturbations, or cross-span compositionality measures (Kim et al., 2020, Kim et al., 2020, Li et al., 2023). The resulting score matrix is used to extract the binary parse tree minimizing total span cost via CKY or top-down recursive algorithms.

2. Extraction Algorithms: Chart, Greedy, and Pointing

Chart-based CKY decoding: The dominant algorithm is bottom-up dynamic programming: for each span $\ell_i = (n_i, C_i, u_i)$ 1,

$\ell_i = (n_i, C_i, u_i)$ 2

The full tree $\ell_i = (n_i, C_i, u_i)$ 3 is the argument minimizing $\ell_i = (n_i, C_i, u_i)$ 4 (Kim et al., 2020, Kim, 2022, Li et al., 2023). Fixed-length normalization and multi-head ensemble averaging are used to stabilize extraction.

Top-K and ensemble selection: Instead of relying on a single attention head, ensemble extraction averages the span scores from the best $\ell_i = (n_i, C_i, u_i)$ 5 heads chosen by dev-set $\ell_i = (n_i, C_i, u_i)$ 6 or by unsupervised ranking criteria ("Heads-up!" dynamic K) (Kim, 2022, Li et al., 2020). Greedy and beam search algorithms select maximally consistent sets of heads to form robust ensemble parses.
Pointing-based and fast decoders: Some architectures reduce parsing to a sequence of pointing or boundary-selection tasks using local cross-entropy losses, yielding competitive $\ell_i = (n_i, C_i, u_i)$ 7 greedy decoders (Nguyen et al., 2020).
Contextual distortion extraction: For masked LMs, syntactic constituency can be detected as spans that minimally distort contextual representations under perturbations (substitution, decontextualization, movement), and a CKY variant parses the normalized distortion matrix (Li et al., 2023).

3. Empirical Evaluation and Performance

English Constituency Parsing (PTB, F₁)

Method	Model(s)	Dev-tuned?	F₁ (%)	Reference
Sequence labeling (ff-ft)	BERT-base	Yes	93.5	(Vilares et al., 2020)
Chart-based Top20 Ensemble	RoBERTa/XLNet	Yes	46.4	(Kim et al., 2020)
Chart-based Greedy Beam	16 PLMs	Yes	55.7	(Kim, 2022)
Contextual distortion	BERT-large, RoBERTa	No	48.8	(Li et al., 2023)
Pointing (fine-tuned)	BERT-large	Yes	95.48	(Nguyen et al., 2020)
Heads-up!	XLNet-base, RoBERTa	No	42.7	(Li et al., 2020)

Unsupervised CPE-PLM chart ensembles match the F₁ of neural PCFGs and in some configurations outperform tuned approaches on small categories and adverbial phrases (Kim, 2022, Kim et al., 2020). The best parameter-free methods reach ~46–56 F₁; probe/fine-tuned architectures approach supervised SOTA (≥95 F₁).

Multilingual Performance and Transfer

CPE-PLM methods generalize robustly to morphologically rich languages (SPMRL datasets) and low-resource settings, matching or exceeding traditional PCFGs in 8/9 languages (Kim et al., 2020, Kim, 2022). Zero-shot ensembles use English-tuned heads for cross-lingual parsing with $\ell_i = (n_i, C_i, u_i)$ 82 F₁ drop. Multilingual PLMs (mBERT, XLM-R) sustain high bracket recall even on small probes if the language is present in the pretraining corpus (Muñoz-Ortiz et al., 2023, Tran et al., 2020).

4. Probing, Interpretability, and Implied Syntactic Biases

Probing analyses demonstrate that a single linear classifier on last-layer LM vectors suffices to recover tree brackets and constituent categories (depth offsets, LCA labels, chunk tag boundaries) (Arps et al., 2022, Vilares et al., 2020). Linear separability is high (82–95% bracket F₁), and syntactic signals are distributed across middle layers.
Attention head structure: Certain heads cluster words into constituents (horizontal blocks in heatmaps), and middle-to-upper Transformer layers recurrently host universal phrase detectors (Kim et al., 2020). Attention-distance-based scores are more informative than hidden vector metrics for zero-shot parsing (Kim et al., 2020).
Contextual distortion: Low average distortion under perturbation identifies true constituents, with movement-based perturbations most critical for SBAR/PP/ADVP recall (Li et al., 2023).
Sequence labeling vs. chart methods: Both paradigms are viable; no intrinsic formalism bias is observed in LM representations (constituency and dependency structures are recoverable to similar degrees) (Muñoz-Ortiz et al., 2023).

5. Biases, Limitations, and Control Experiments

Branching bias—systematic preference for certain tree shapes (right-branching, head-initial)—is a recurring concern. Margin-based chart parsers ("Mart") and prefix-only attention feature definitions induce strong right-branching bias, significantly inflating F₁ on English and suppressing scores in reversed-language controls (Li et al., 2020). Distance-based (Dist) parsers and symmetric feature scores (full attention or hidden-vector distances) remain unbiased, and reversed-corpus gap $\ell_i = (n_i, C_i, u_i)$ 9 provides an algorithm-agnostic metric of structural directionality. Explicit right-branching bias injection boosts SBAR/VP recall by up to 10 pts if tuned but should be set to zero for interpretable probing (Kim et al., 2020).

Mitigation strategies include symmetric chart decoders, feature selection ablations, and universal reversed-language benchmarking (Li et al., 2020, Kim, 2022).

6. Practical Applications and Use Cases

CPE-PLM trees serve as high-quality pseudo-annotations for downstream models:

Bootstrapping unsupervised RNNGs: Trees induced by CPE-PLM allow training RNNGs/URNNGs to achieve strong parsing and lower LLM perplexity (Kim, 2022).
Tree-LSTM text classification: Classifiers using CPE-PLM trees nearly reach supervised parse accuracy (Kim, 2022).
Silver parser distillation: Fast supervised parsers trained on CPE-PLM outputs match or exceed their unlabeled F₁ in under 1% of the inference time.
Few-shot regimes: CPE-PLM needs only 1% of dev data for ensemble tuning to outperform vanilla few-shot supervised parsers by $i$ 030 F₁ (Kim, 2022).

7. Methodological Variants and Future Directions

LLM-based parsing: Modern LLMs (ChatGPT, GPT-4, LLaMA, OPT) can predict linearized bracketed trees using prompt-based or fine-tuned strategies. Zero-shot and 5-shot in-context learning yield limited valid parse accuracy; fine-tuning brings LLM parses to SOTA in-domain F₁, but hallucination errors, domain shift, and invalid-tree rates remain challenging (Bai et al., 2023). Constraint decoding and grammar-aware beam search are recommendations for robust deployment.
Language, tokenization, and resource effects: Language presence in LM pretraining is a stronger determinant of recoverable syntax than labeled treebank size (Muñoz-Ortiz et al., 2023). Subword tokenization is preferable to character-based models for reliable structure extraction.
Open problems: Extensions to dependency parsing via perturbation, autoregressive LM adaption, and hybrid self-training approaches are promising future avenues (Li et al., 2023, Kim et al., 2020). Full theoretical characterizations of why MLM objectives encode Inside-Outside marginal probabilities in PCFGs remain active research (Zhao et al., 2023).