- The paper presents GiLT, a model that augments Transformer language models with dynamically constructed dependency graphs to enhance syntactic generalization.
- The methodology integrates dependency scoring and graph population with attention modulation, achieving competitive perplexity and improved downstream task performance.
- Empirical results demonstrate enhanced syntactic generalization, superior performance on GLUE tasks, and improved inference efficiency relative to baseline models.
Motivation and Context
The integration of linguistic structures into neural LLMs is a longstanding topic in NLP, aiming to enhance syntactic generalization, interpretability, and downstream task performance. Traditional Transformer LMs, despite their empirical success, remain largely disconnected from explicit linguistic structure such as syntax or semantic parse. Prior efforts primarily focus on synchronizing constituency trees with language modeling, often requiring token supplementation to encode structural information. These approaches incur increased sequence length, latency, and computational cost, and complicate the adaptation of pretrained LMs. GiLT (Graph-Infused Layers Transformer) addresses these limitations by leveraging dependency graphsโa broader formalism encompassing both syntactic trees and semantic relationsโto modulate Transformer self-attention without modifying the token sequence space.
Methodology
GiLT incrementally builds dependency graphs alongside token generation. Structural information is injected directly into the attention computation via feature tapes extracted from the partial dependency graph constructed thus far. The core components include:
- Dependency Scoring: Each generated word undergoes biaffine scoring to determine potential dependencies with preceding words, leveraging multi-layer representations and positional/graph-based embeddings.
- Graph Population: A two-step process predicts the number of dependencies and selects the highest-scoring candidates, ensuring computational tractability within beam search decoding frameworks.
- Feature Extraction: Degree, distance, and depth for each word are computed from the graph, weighted by directionality, and mapped to embeddings for attention modulation.
- Attention Fusion: The feature tape representation is fused into the attention module, augmenting key vectors and relative positional encodings per layer following Transformer-XL conventions.
- Training & Inference: GiLT is trained on syntactically annotated corpora using teacher forcing, with dependency and token prediction losses. During inference, beam search of dependency graphs approximates marginalized probabilities, obviating the need for output space augmentation with parsing actions.
Critically, GiLT is compatible with any pretrained Transformer backbone, allowing seamless finetuning on dependency-annotated datasets for downstream task adaptation.
Empirical Evaluation
Language Modeling
GiLT variants (trained on PSD, DM, PAS, and DP dependency datasets) are evaluated on the BLLIP-LG corpus, benchmarked against TXL baselines and syntactic LMs such as Pushdown-LM, PLM, TG, and DTG. GiLT achieves perplexity (PPL) commensurate with TXL and Pushdown-LM, unlike several syntactic LMs that sacrifice PPL for inductive bias. Notably, dependency graph-based models (PSD, DM, PAS) yield consistently superior PPL relative to tree-based GiLT-DP, underscoring the flexibility of graph-structured augmentation.
Syntactic Generalization
On BLIMP and SG test suites, GiLT-PSD demonstrates elevated syntactic generalization scores, surpassing TXL baseline by 0.6% (BLIMP) and 7.6% (SG). Parameter scaling alone (TXL-Large) provides marginal improvements, validating the utility of structural inductive bias over brute-force scaling.
Downstream Task Finetuning
Finetuning GiLT from pretrained GPT2 on GLUE tasks (RTE, SST2, MRPC, STS-B) consistently outperforms Post-GPT2 (finetuned vanilla GPT2) across all tasks. Enhancement in both language understanding and syntactic generalization persists post-finetuning, reflected in maintained BLIMP and SG scores.
Efficiency and Ablation
GiLT delivers faster inference and reduced memory consumption compared to DTG, particularly as beam size increases. Ablation studies affirm the necessity of all three features (degree, depth, distance) and their weighted computation; the removal of any component degrades syntactic generalization or macro SG scores but leaves PPL largely unperturbed.
Attention Analysis
Qualitative analysis shows GiLT attention better tracks the syntactic subject and modifier relations in complex sentences compared to TXL, attributing attention to relevant governing words as encoded by the dependency graph.
Theoretical and Practical Implications
GiLT demonstrates that augmenting attention with dependency graph features enables Transformers to generalize syntactic phenomena, improve downstream task performance, and maintain competitive language modeling metric. The architectural compatibility with pretrained models is a critical practical feature for deployment and adaptation. From a theoretical perspective, the shift from tree-centric to graph-centric structural induction broadens the scope for modeling semantic relations, multi-headed dependencies, and non-projective structures. The efficiency and robustness gains further strengthen the case for graph-based structural guidance in large-scale language modeling.
Future directions include unsupervised graph induction, joint modeling of multiple dependency graph formalisms, and further optimization of feature extraction and integration mechanisms. Limitations arise from reliance on beam search for graph marginalization, prompting research on scalable inference and approximations.
Conclusion
GiLT offers a principled framework for enriching Transformer LMs with dependency graph features, providing significant syntactic generalization and downstream performance gains with efficiency advantages. The modelโs capacity to incrementally construct and exploit structural information without altering input/output token space is particularly valuable for leveraging pretrained models and achieving syntactically aware language modeling (2605.15562).