Compositional Attribute Grammars (CAG)

Updated 27 April 2026

Compositional Attribute Grammars are transition-based syntactic models that jointly generate terminal strings and their corresponding constituency trees.
They employ a bidirectional LSTM-based composition function to synthesize subtree embeddings and propagate syntactic attributes.
Empirical results show CAGs achieving up to 83.8% accuracy on SyntaxGym benchmarks, outperforming several baseline models in syntactic generalization.

A Composition-Attention Grammar (CAG) is a transition-based, top-down syntactic LLM that jointly generates both a string of terminals and its corresponding constituency tree. CAGs are a specific form of attribute grammar in which each subtree is associated with a synthesized attribute vector computed by a neural composition function, and the prediction of subsequent actions is conditioned on a self-attention mechanism over partially constructed subtrees. CAGs unify the strengths of structured composition and non-local attention to induce human-like syntactic generalization in language modeling (Yoshida et al., 2022).

1. Formal Structure and Action Dynamics

CAGs maintain an explicit stack of embeddings during parse generation. The vocabulary of allowed actions consists of:

NT(X): Open a new nonterminal labeled X.
GEN(x): Generate a terminal symbol x.
REDUCE: Close the most recent open nonterminal.

The model defines a joint probability over the generated terminal string $X$ and its constituency tree $Y$ , parameterized as

$p(X, Y) = \prod_{t=1}^{n} p(a_t \mid a_{<t})$

where $a_t$ is chosen from $\{\mathrm{NT}(\cdot), \mathrm{GEN}(\cdot), \mathrm{REDUCE}\}$ . At each timestep, the model uses the self-attention summary of current stack vectors to score all possible next actions. Recurrence is only via the stack states and their learned embeddings.

2. Neural Composition Function

The core of CAGs' attribute propagation is a bidirectional LSTM-based composition function applied at every REDUCE operation. When a REDUCE closes a nonterminal spanning stack positions $l$ through $m$ , the child embeddings $[\mathbf{e}_l, ..., \mathbf{e}_m]$ are passed through a BiLSTM:

Forward LSTM: $\overrightarrow{h}_i = \mathrm{LSTM_{fwd}}(\mathbf{e}_i, \overrightarrow{h}_{i-1})$ for $i=l$ to $Y$ 0
Backward LSTM: $Y$ 1 for $Y$ 2 to $Y$ 3

The output subtree vector is then computed as:

$Y$ 4

where $Y$ 5 and $Y$ 6 are learned projection parameters.

This function can be succinctly denoted as $Y$ 7.

3. Self-Attention Over Partial Parses

At each action, the stack contains embeddings $Y$ 8. Self-attention is computed according to the standard scaled dot-product paradigm:

Queries, keys, values: $Y$ 9 for learned $p(X, Y) = \prod_{t=1}^{n} p(a_t \mid a_{<t})$ 0
Attention matrix: $p(X, Y) = \prod_{t=1}^{n} p(a_t \mid a_{<t})$ 1
Attended outputs: $p(X, Y) = \prod_{t=1}^{n} p(a_t \mid a_{<t})$ 2

A summary vector $p(X, Y) = \prod_{t=1}^{n} p(a_t \mid a_{<t})$ 3 is extracted from $p(X, Y) = \prod_{t=1}^{n} p(a_t \mid a_{<t})$ 4 (e.g., by last-row selection or pooling). The multi-head variant repeats this process per head and concatenates, and is notated as:

$p(X, Y) = \prod_{t=1}^{n} p(a_t \mid a_{<t})$ 5

4. Parsing, Scoring, and Training Objective

The complete parsing and scoring workflow is as follows:

$a_t$ 8

The model is trained by maximizing the log-likelihood of the gold action sequence:

$p(X, Y) = \prod_{t=1}^{n} p(a_t \mid a_{<t})$ 6

where $p(X, Y) = \prod_{t=1}^{n} p(a_t \mid a_{<t})$ 7. Only the negative log-likelihood loss is used; no auxiliary objectives are introduced.

5. Experimental Setup

CAGs were evaluated by Yoshida & Oseki (2022) on broad-coverage data and a range of syntactic architectures:

Training corpus: BLLIP-lg (Brown 1987–89), re-parsed with Kitaev & Klein (2018) constituency parser; $p(X, Y) = \prod_{t=1}^{n} p(a_t \mid a_{<t})$ 8M sentences ( $p(X, Y) = \prod_{t=1}^{n} p(a_t \mid a_{<t})$ 9M tokens).
Model classes (all $a_t$ 0M params):
- LSTM
- ActionLSTM
- RNNG
- Transformer
- PLM
- PLM-mask
- CAG

Key architectural hyperparameters:

LSTM: 2 layers, 301 hidden units
Transformer: 3 layers, 272 hidden, 4 heads
CAG: 3 layers, 256 hidden, 4 heads
Dropout: 0.1; learning rate $a_t$ 1; batch size 256; 15 epochs

Evaluation metrics: Syntactic generalization on the SyntaxGym benchmark, using six test circuits:

Agreement
Licensing
Garden-Path Effects
Gross Syntactic State
Center Embedding
Long-Distance Dependencies

Performance is measured as the percentage of test suites passed, comparing model probabilities for minimal-pair grammatical/ungrammatical stimuli, with syntactic models decoded using word-synchronous beam search (action beam 100, word beam 10, fast-track 5).

6. Empirical Results and Findings

Overall SyntaxGym Accuracy

Model	Accuracy (%)
LSTM	56.6 ± 3.3
Transformer	48.1 ± 1.5
ActionLSTM	72.5 ± 1.8
PLM	75.4 ± 0.2
RNNG	81.1 ± 2.8
PLM-mask	69.6 ± 0.9
CAG	83.8 ± 1.4

Adding explicit syntax yields increases of $a_t$ 2– $a_t$ 3 points over syntax-absent baselines; composition gives an $a_t$ 4– $a_t$ 5 point boost; adding self-attention in syntax-aware models yields an additional $a_t$ 6– $a_t$ 7 points.

Per-Circuit Accuracies

Circuit	LSTM	ActionLSTM	RNNG	Transformer	PLM	PLM-mask	CAG
Agreement	43.9	81.9	77.8	21.1	81.3	80.7	79.5
Licensing	26.9	60.0	83.0	3.7	61.1	42.7	87.0
Garden-Path Effects	69.6	80.1	83.1	67.9	82.2	82.0	84.6
Gross Syntactic State	97.8	90.6	99.3	89.9	96.4	91.3	99.6
Center Embedding	70.2	78.0	73.2	72.6	81.0	77.4	79.2
Long-Distance Dependencies	64.7	68.4	71.5	71.9	73.9	76.9	73.9

7. Interpretation and Theoretical Significance

Comprehensive ablation and error analysis indicates that the neural composition function in CAGs significantly enhances model performance on linguistic phenomena reliant on syntactic feature percolation, such as Licensing, Garden-Path effects, and Gross Syntactic State. For instance, reflexive dependencies—such as those requiring subject NP number agreement—are successfully handled because the entire NP is embedded into a vector encoding this syntactic information.

However, composition can hinder tasks requiring semantic or lexical plausibility (notably Center Embedding and some Long-Distance Dependency tests), as embedding-based summarization may obscure head-noun semantics necessary for animacy-based plausibility computations.

The composition function in CAGs thus specializes in propagating syntactic attributes (e.g., number, hierarchical structure), but does not robustly encode lower-level semantic or lexical features. The introduction of stack self-attention complements composition by enabling both local and non-local structural integration, yielding the highest syntactic generalization among evaluated models and approximating human-like syntactic competence more closely than all alternatives in the experimental comparison (Yoshida et al., 2022).

Markdown Report Issue Upgrade to Chat

References (1)

Composition, Attention, or Both? (2022)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Compositional Attribute Grammars (CAG).

Compositional Attribute Grammars (CAG)

1. Formal Structure and Action Dynamics

2. Neural Composition Function

3. Self-Attention Over Partial Parses

4. Parsing, Scoring, and Training Objective

5. Experimental Setup

6. Empirical Results and Findings

Overall SyntaxGym Accuracy

Per-Circuit Accuracies

7. Interpretation and Theoretical Significance

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Compositional Attribute Grammars (CAG)

1. Formal Structure and Action Dynamics

2. Neural Composition Function

3. Self-Attention Over Partial Parses

4. Parsing, Scoring, and Training Objective

5. Experimental Setup

6. Empirical Results and Findings

Overall SyntaxGym Accuracy

Per-Circuit Accuracies

7. Interpretation and Theoretical Significance

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research