Papers
Topics
Authors
Recent
Search
2000 character limit reached

Compositional Attribute Grammars (CAG)

Updated 27 April 2026
  • Compositional Attribute Grammars are transition-based syntactic models that jointly generate terminal strings and their corresponding constituency trees.
  • They employ a bidirectional LSTM-based composition function to synthesize subtree embeddings and propagate syntactic attributes.
  • Empirical results show CAGs achieving up to 83.8% accuracy on SyntaxGym benchmarks, outperforming several baseline models in syntactic generalization.

A Composition-Attention Grammar (CAG) is a transition-based, top-down syntactic LLM that jointly generates both a string of terminals and its corresponding constituency tree. CAGs are a specific form of attribute grammar in which each subtree is associated with a synthesized attribute vector computed by a neural composition function, and the prediction of subsequent actions is conditioned on a self-attention mechanism over partially constructed subtrees. CAGs unify the strengths of structured composition and non-local attention to induce human-like syntactic generalization in language modeling (Yoshida et al., 2022).

1. Formal Structure and Action Dynamics

CAGs maintain an explicit stack of embeddings during parse generation. The vocabulary of allowed actions consists of:

  • NT(X): Open a new nonterminal labeled X.
  • GEN(x): Generate a terminal symbol x.
  • REDUCE: Close the most recent open nonterminal.

The model defines a joint probability over the generated terminal string XX and its constituency tree YY, parameterized as

p(X,Y)=t=1np(ata<t)p(X, Y) = \prod_{t=1}^{n} p(a_t \mid a_{<t})

where ata_t is chosen from {NT(),GEN(),REDUCE}\{\mathrm{NT}(\cdot), \mathrm{GEN}(\cdot), \mathrm{REDUCE}\}. At each timestep, the model uses the self-attention summary of current stack vectors to score all possible next actions. Recurrence is only via the stack states and their learned embeddings.

2. Neural Composition Function

The core of CAGs' attribute propagation is a bidirectional LSTM-based composition function applied at every REDUCE operation. When a REDUCE closes a nonterminal spanning stack positions ll through mm, the child embeddings [el,...,em][\mathbf{e}_l, ..., \mathbf{e}_m] are passed through a BiLSTM:

  • Forward LSTM: hi=LSTMfwd(ei,hi1)\overrightarrow{h}_i = \mathrm{LSTM_{fwd}}(\mathbf{e}_i, \overrightarrow{h}_{i-1}) for i=li=l to YY0
  • Backward LSTM: YY1 for YY2 to YY3

The output subtree vector is then computed as:

YY4

where YY5 and YY6 are learned projection parameters.

This function can be succinctly denoted as YY7.

3. Self-Attention Over Partial Parses

At each action, the stack contains embeddings YY8. Self-attention is computed according to the standard scaled dot-product paradigm:

  • Queries, keys, values: YY9 for learned p(X,Y)=t=1np(ata<t)p(X, Y) = \prod_{t=1}^{n} p(a_t \mid a_{<t})0
  • Attention matrix: p(X,Y)=t=1np(ata<t)p(X, Y) = \prod_{t=1}^{n} p(a_t \mid a_{<t})1
  • Attended outputs: p(X,Y)=t=1np(ata<t)p(X, Y) = \prod_{t=1}^{n} p(a_t \mid a_{<t})2

A summary vector p(X,Y)=t=1np(ata<t)p(X, Y) = \prod_{t=1}^{n} p(a_t \mid a_{<t})3 is extracted from p(X,Y)=t=1np(ata<t)p(X, Y) = \prod_{t=1}^{n} p(a_t \mid a_{<t})4 (e.g., by last-row selection or pooling). The multi-head variant repeats this process per head and concatenates, and is notated as:

p(X,Y)=t=1np(ata<t)p(X, Y) = \prod_{t=1}^{n} p(a_t \mid a_{<t})5

4. Parsing, Scoring, and Training Objective

The complete parsing and scoring workflow is as follows:

ata_t8

The model is trained by maximizing the log-likelihood of the gold action sequence:

p(X,Y)=t=1np(ata<t)p(X, Y) = \prod_{t=1}^{n} p(a_t \mid a_{<t})6

where p(X,Y)=t=1np(ata<t)p(X, Y) = \prod_{t=1}^{n} p(a_t \mid a_{<t})7. Only the negative log-likelihood loss is used; no auxiliary objectives are introduced.

5. Experimental Setup

CAGs were evaluated by Yoshida & Oseki (2022) on broad-coverage data and a range of syntactic architectures:

  • Training corpus: BLLIP-lg (Brown 1987–89), re-parsed with Kitaev & Klein (2018) constituency parser; p(X,Y)=t=1np(ata<t)p(X, Y) = \prod_{t=1}^{n} p(a_t \mid a_{<t})8M sentences (p(X,Y)=t=1np(ata<t)p(X, Y) = \prod_{t=1}^{n} p(a_t \mid a_{<t})9M tokens).
  • Model classes (all ata_t0M params):
    • LSTM
    • ActionLSTM
    • RNNG
    • Transformer
    • PLM
    • PLM-mask
    • CAG

Key architectural hyperparameters:

  • LSTM: 2 layers, 301 hidden units
  • Transformer: 3 layers, 272 hidden, 4 heads
  • CAG: 3 layers, 256 hidden, 4 heads
  • Dropout: 0.1; learning rate ata_t1; batch size 256; 15 epochs

Evaluation metrics: Syntactic generalization on the SyntaxGym benchmark, using six test circuits:

  1. Agreement
  2. Licensing
  3. Garden-Path Effects
  4. Gross Syntactic State
  5. Center Embedding
  6. Long-Distance Dependencies

Performance is measured as the percentage of test suites passed, comparing model probabilities for minimal-pair grammatical/ungrammatical stimuli, with syntactic models decoded using word-synchronous beam search (action beam 100, word beam 10, fast-track 5).

6. Empirical Results and Findings

Overall SyntaxGym Accuracy

Model Accuracy (%)
LSTM 56.6 ± 3.3
Transformer 48.1 ± 1.5
ActionLSTM 72.5 ± 1.8
PLM 75.4 ± 0.2
RNNG 81.1 ± 2.8
PLM-mask 69.6 ± 0.9
CAG 83.8 ± 1.4

Adding explicit syntax yields increases of ata_t2–ata_t3 points over syntax-absent baselines; composition gives an ata_t4–ata_t5 point boost; adding self-attention in syntax-aware models yields an additional ata_t6–ata_t7 points.

Per-Circuit Accuracies

Circuit LSTM ActionLSTM RNNG Transformer PLM PLM-mask CAG
Agreement 43.9 81.9 77.8 21.1 81.3 80.7 79.5
Licensing 26.9 60.0 83.0 3.7 61.1 42.7 87.0
Garden-Path Effects 69.6 80.1 83.1 67.9 82.2 82.0 84.6
Gross Syntactic State 97.8 90.6 99.3 89.9 96.4 91.3 99.6
Center Embedding 70.2 78.0 73.2 72.6 81.0 77.4 79.2
Long-Distance Dependencies 64.7 68.4 71.5 71.9 73.9 76.9 73.9

7. Interpretation and Theoretical Significance

Comprehensive ablation and error analysis indicates that the neural composition function in CAGs significantly enhances model performance on linguistic phenomena reliant on syntactic feature percolation, such as Licensing, Garden-Path effects, and Gross Syntactic State. For instance, reflexive dependencies—such as those requiring subject NP number agreement—are successfully handled because the entire NP is embedded into a vector encoding this syntactic information.

However, composition can hinder tasks requiring semantic or lexical plausibility (notably Center Embedding and some Long-Distance Dependency tests), as embedding-based summarization may obscure head-noun semantics necessary for animacy-based plausibility computations.

The composition function in CAGs thus specializes in propagating syntactic attributes (e.g., number, hierarchical structure), but does not robustly encode lower-level semantic or lexical features. The introduction of stack self-attention complements composition by enabling both local and non-local structural integration, yielding the highest syntactic generalization among evaluated models and approximating human-like syntactic competence more closely than all alternatives in the experimental comparison (Yoshida et al., 2022).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Compositional Attribute Grammars (CAG).