Compositional Attribute Grammars (CAG)
- Compositional Attribute Grammars are transition-based syntactic models that jointly generate terminal strings and their corresponding constituency trees.
- They employ a bidirectional LSTM-based composition function to synthesize subtree embeddings and propagate syntactic attributes.
- Empirical results show CAGs achieving up to 83.8% accuracy on SyntaxGym benchmarks, outperforming several baseline models in syntactic generalization.
A Composition-Attention Grammar (CAG) is a transition-based, top-down syntactic LLM that jointly generates both a string of terminals and its corresponding constituency tree. CAGs are a specific form of attribute grammar in which each subtree is associated with a synthesized attribute vector computed by a neural composition function, and the prediction of subsequent actions is conditioned on a self-attention mechanism over partially constructed subtrees. CAGs unify the strengths of structured composition and non-local attention to induce human-like syntactic generalization in language modeling (Yoshida et al., 2022).
1. Formal Structure and Action Dynamics
CAGs maintain an explicit stack of embeddings during parse generation. The vocabulary of allowed actions consists of:
- NT(X): Open a new nonterminal labeled X.
- GEN(x): Generate a terminal symbol x.
- REDUCE: Close the most recent open nonterminal.
The model defines a joint probability over the generated terminal string and its constituency tree , parameterized as
where is chosen from . At each timestep, the model uses the self-attention summary of current stack vectors to score all possible next actions. Recurrence is only via the stack states and their learned embeddings.
2. Neural Composition Function
The core of CAGs' attribute propagation is a bidirectional LSTM-based composition function applied at every REDUCE operation. When a REDUCE closes a nonterminal spanning stack positions through , the child embeddings are passed through a BiLSTM:
- Forward LSTM: for to 0
- Backward LSTM: 1 for 2 to 3
The output subtree vector is then computed as:
4
where 5 and 6 are learned projection parameters.
This function can be succinctly denoted as 7.
3. Self-Attention Over Partial Parses
At each action, the stack contains embeddings 8. Self-attention is computed according to the standard scaled dot-product paradigm:
- Queries, keys, values: 9 for learned 0
- Attention matrix: 1
- Attended outputs: 2
A summary vector 3 is extracted from 4 (e.g., by last-row selection or pooling). The multi-head variant repeats this process per head and concatenates, and is notated as:
5
4. Parsing, Scoring, and Training Objective
The complete parsing and scoring workflow is as follows:
8
The model is trained by maximizing the log-likelihood of the gold action sequence:
6
where 7. Only the negative log-likelihood loss is used; no auxiliary objectives are introduced.
5. Experimental Setup
CAGs were evaluated by Yoshida & Oseki (2022) on broad-coverage data and a range of syntactic architectures:
- Training corpus: BLLIP-lg (Brown 1987–89), re-parsed with Kitaev & Klein (2018) constituency parser; 8M sentences (9M tokens).
- Model classes (all 0M params):
- LSTM
- ActionLSTM
- RNNG
- Transformer
- PLM
- PLM-mask
- CAG
Key architectural hyperparameters:
- LSTM: 2 layers, 301 hidden units
- Transformer: 3 layers, 272 hidden, 4 heads
- CAG: 3 layers, 256 hidden, 4 heads
- Dropout: 0.1; learning rate 1; batch size 256; 15 epochs
Evaluation metrics: Syntactic generalization on the SyntaxGym benchmark, using six test circuits:
- Agreement
- Licensing
- Garden-Path Effects
- Gross Syntactic State
- Center Embedding
- Long-Distance Dependencies
Performance is measured as the percentage of test suites passed, comparing model probabilities for minimal-pair grammatical/ungrammatical stimuli, with syntactic models decoded using word-synchronous beam search (action beam 100, word beam 10, fast-track 5).
6. Empirical Results and Findings
Overall SyntaxGym Accuracy
| Model | Accuracy (%) |
|---|---|
| LSTM | 56.6 ± 3.3 |
| Transformer | 48.1 ± 1.5 |
| ActionLSTM | 72.5 ± 1.8 |
| PLM | 75.4 ± 0.2 |
| RNNG | 81.1 ± 2.8 |
| PLM-mask | 69.6 ± 0.9 |
| CAG | 83.8 ± 1.4 |
Adding explicit syntax yields increases of 2–3 points over syntax-absent baselines; composition gives an 4–5 point boost; adding self-attention in syntax-aware models yields an additional 6–7 points.
Per-Circuit Accuracies
| Circuit | LSTM | ActionLSTM | RNNG | Transformer | PLM | PLM-mask | CAG |
|---|---|---|---|---|---|---|---|
| Agreement | 43.9 | 81.9 | 77.8 | 21.1 | 81.3 | 80.7 | 79.5 |
| Licensing | 26.9 | 60.0 | 83.0 | 3.7 | 61.1 | 42.7 | 87.0 |
| Garden-Path Effects | 69.6 | 80.1 | 83.1 | 67.9 | 82.2 | 82.0 | 84.6 |
| Gross Syntactic State | 97.8 | 90.6 | 99.3 | 89.9 | 96.4 | 91.3 | 99.6 |
| Center Embedding | 70.2 | 78.0 | 73.2 | 72.6 | 81.0 | 77.4 | 79.2 |
| Long-Distance Dependencies | 64.7 | 68.4 | 71.5 | 71.9 | 73.9 | 76.9 | 73.9 |
7. Interpretation and Theoretical Significance
Comprehensive ablation and error analysis indicates that the neural composition function in CAGs significantly enhances model performance on linguistic phenomena reliant on syntactic feature percolation, such as Licensing, Garden-Path effects, and Gross Syntactic State. For instance, reflexive dependencies—such as those requiring subject NP number agreement—are successfully handled because the entire NP is embedded into a vector encoding this syntactic information.
However, composition can hinder tasks requiring semantic or lexical plausibility (notably Center Embedding and some Long-Distance Dependency tests), as embedding-based summarization may obscure head-noun semantics necessary for animacy-based plausibility computations.
The composition function in CAGs thus specializes in propagating syntactic attributes (e.g., number, hierarchical structure), but does not robustly encode lower-level semantic or lexical features. The introduction of stack self-attention complements composition by enabling both local and non-local structural integration, yielding the highest syntactic generalization among evaluated models and approximating human-like syntactic competence more closely than all alternatives in the experimental comparison (Yoshida et al., 2022).