A Systematic Assessment of Syntactic Generalization in Neural Language Models (2005.03692v2)

Published 7 May 2020 in cs.CL

Abstract: While state-of-the-art neural network models continue to achieve lower perplexity scores on LLMing benchmarks, it remains unknown whether optimizing for broad-coverage predictive performance leads to human-like syntactic knowledge. Furthermore, existing work has not provided a clear picture about the model properties required to produce proper syntactic generalizations. We present a systematic evaluation of the syntactic knowledge of neural LLMs, testing 20 combinations of model types and data sizes on a set of 34 English-language syntactic test suites. We find substantial differences in syntactic generalization performance by model architecture, with sequential models underperforming other architectures. Factorially manipulating model architecture and training dataset size (1M--40M words), we find that variability in syntactic generalization performance is substantially greater by architecture than by dataset size for the corpora tested in our experiments. Our results also reveal a dissociation between perplexity and syntactic generalization performance.

PDF Abstract

Evaluating Syntactic Generalization in Neural LLMs

The paper "A Systematic Assessment of Syntactic Generalization in Neural LLMs" by Jennifer Hu et al. provides a comprehensive analysis of the syntactic capabilities of neural LLMs (NLMs). Given the rapid advancements in NLMs and their ability to achieve lower perplexity scores, this research seeks to determine if these models encapsulate human-like syntactic knowledge. It emphasizes the need to evaluate models using both information-theoretic metrics, such as perplexity, and targeted syntactic evaluations.

Study Design and Methods

The authors conduct a systematic evaluation of a wide range of model architectures using 20 combinations of model types and data sizes, ranging from 1 million to 40 million words, across 34 English-language syntactic test suites. These test suites encompass various syntactic phenomena such as subject-verb agreement, filler-gap dependencies, and garden-path effects, among others.

The models investigated include Long Short-Term Memory networks (LSTM), Ordered-Neurons LSTM (ON-LSTM), Recurrent Neural Network Grammars (RNNG), and GPT-2 for Transformers. The evaluation is enhanced by consideration of off-the-shelf models which are trained on larger datasets, up to 2 billion tokens.

Key Findings

Dissociation between Perplexity and Syntactic Generalization: The results demonstrate a notable dissociation between perplexity and syntactic generalization performance. This suggests that a model's ability to reduce perplexity does not necessarily translate to better syntactic comprehension, indicating the inadequacy of perplexity as a comprehensive evaluation metric on its own.
Impact of Model Architecture over Data Size: The paper finds that variability in syntactic generalization performance is more significantly influenced by model architecture than by the size of the training dataset. This is exemplified by models with explicit structural supervision outperforming others and achieving robust syntactic generalization scores even with reduced data sizes.
Model-specific Strengths and Weaknesses: Different architectures exhibit distinct strengths across syntactic test types. For instance, the RNNG and Transformer models handle various syntactic challenges effectively, reflecting their architectural advantages in representing hierarchical structures.
Robustness to Intervening Content: The paper also assesses model stability in the presence of syntactically irrelevant intervening content. This sheds light on models' robustness and their ability to maintain syntactic generalizations across variations in sentence construction.

Implications and Future Directions

The dissociation between perplexity and syntactic generalization underscores the necessity of integrating fine-grained linguistic assessments in model evaluation pipelines. This research provides a framework for examining the syntactic learning outcomes of NLMs under more realistic language processing conditions. Furthermore, the findings suggest potential pathways for optimizing NLM architectures for specific syntactic tasks, enhancing their utility in natural language processing applications.

Overall, the paper contributes to a deeper understanding of the syntactic knowledge encapsulated by NLMs, laying the groundwork for future advancements in AI. It also raises essential questions about the sufficiency of string-based training in acquiring comprehensive syntactic knowledge, encouraging further exploration into architectures that mimic human-like language processing mechanisms.

PDF Markdown Bookmark Chat (Pro)

Authors (5)

Jennifer Hu (22 papers)
Jon Gauthier (11 papers)
Peng Qian (39 papers)
Ethan Wilcox (24 papers)
Roger P. Levy (12 papers)

Citations (200)

View on Semantic Scholar

A Systematic Assessment of Syntactic Generalization in Neural Language Models (2005.03692v2)

Evaluating Syntactic Generalization in Neural LLMs

Study Design and Methods

Key Findings

Implications and Future Directions

Related Papers