Revisiting the poverty of the stimulus: hierarchical generalization without a hierarchical bias in recurrent neural networks (1802.09091v3)

Published 25 Feb 2018 in cs.CL

Abstract: Syntactic rules in natural language typically need to make reference to hierarchical sentence structure. However, the simple examples that language learners receive are often equally compatible with linear rules. Children consistently ignore these linear explanations and settle instead on the correct hierarchical one. This fact has motivated the proposal that the learner's hypothesis space is constrained to include only hierarchical rules. We examine this proposal using recurrent neural networks (RNNs), which are not constrained in such a way. We simulate the acquisition of question formation, a hierarchical transformation, in a fragment of English. We find that some RNN architectures tend to learn the hierarchical rule, suggesting that hierarchical cues within the language, combined with the implicit architectural biases inherent in certain RNNs, may be sufficient to induce hierarchical generalizations. The likelihood of acquiring the hierarchical generalization increased when the language included an additional cue to hierarchy in the form of subject-verb agreement, underscoring the role of cues to hierarchy in the learner's input.

Citations (75)

View on Semantic Scholar

Summary

The paper shows that RNNs can acquire hierarchical generalizations in English question formation without built-in hierarchical bias.
The study compares various RNN architectures, revealing that only the GRU with attention consistently focuses on hierarchical cues over linear ones.
The research finds that input properties like subject-verb agreement significantly enhance hierarchical rule learning in neural networks.

This paper investigates the necessity of innate hierarchical constraints in language learners, specifically addressing the poverty of the stimulus (POS) argument. The authors simulate the acquisition of English subject-auxiliary inversion using recurrent neural networks (RNNs) to determine if these networks can learn hierarchical rules without explicit pre-existing hierarchical constraints.

The core argument revolves around the idea that children acquiring language consistently prefer hierarchical syntactic rules over linear ones, even when exposed to simple examples compatible with both. The authors explore whether RNNs, which are not inherently constrained to hierarchical processing, can learn hierarchical generalizations through exposure to linguistic input.

The paper focuses on the transformation of declarative sentences into questions, as in "My walrus can giggle" becoming "Can my walrus giggle?". Two potential rules for this transformation are considered: a hierarchical rule (moving the main verb's auxiliary) and a linear rule (moving the linearly first auxiliary). While both rules work for simple sentences, they diverge in complex sentences with relative clauses, such as "My walrus that will eat can giggle." The hierarchical rule correctly predicts "Can my walrus that will eat giggle?", while the linear rule incorrectly predicts "*Will my walrus that eat can giggle?".

The authors trained RNNs on two fragments of English: a "no-agreement" language and an "agreement" language. The no-agreement language allows any noun to appear with any auxiliary. The agreement language enforces subject-verb agreement using "do," "don't," "does," and "doesn't" as auxiliaries. The agreement language includes an additional cue to hierarchy; in the example "the walruses that the newt does confuse do high_five your peacocks", do agrees with its hierarchically-determined plural subject walruses even though the singular noun newt is linearly closer to it. The networks were trained on identity and question formation tasks, withholding question formation for sentences with relative clauses on the subject during training.

The following RNN architectures were explored:

Simple Recurrent Network (SRN)
SRN with attention
Gated Recurrent Unit (GRU)
GRU with attention
Long Short-Term Memory (LSTM)
LSTM with attention

The sequence-to-sequence model used includes an encoder and a decoder. The encoder processes the input sentence to create a vector representation, and the decoder generates the output sentence based on this encoding. Attention mechanisms were also incorporated, giving the decoder access to intermediate steps of the encoding process.

Key findings include:

Agreement leads to more robust hierarchical generalization: All six architectures were significantly more likely ( $p < 0.01$ ) to choose the main auxiliary when trained on the agreement language than the no-agreement language.
Initialization matters: Accuracy often varied considerably across random initializations.
Different architectures perform qualitatively differently: Only the GRU with attention showed a strong preference for choosing the main auxiliary instead of the linearly first auxiliary. The vanilla GRU chose the first auxiliary nearly 100% of the time.

To understand these differences, the authors analyzed the final hidden state of the encoder, referred to as the encoding of the sentence, focusing on how much information these encodings contained about the main auxiliary, the fourth word, and the subject noun. Linear classifiers were trained to predict these properties from the sentence encodings. The results indicated that while most architectures could identify the main auxiliary, the GRU with attention was unique in performing poorly on the fourth word and subject noun tasks. This suggests that the GRU with attention focused on hierarchical information, ignoring linear cues.

Further analysis of the full questions produced by the networks revealed that the GRU with attention networks' errors sometimes aligned with errors made by children learning English. However, the networks also made mistakes that humans never would. For example, the network preposed the second auxiliary but did not delete either of the auxiliaries. This error type is common among English-learning children and is compatible with hierarchical generalization. In another frequent error type, the network deleted the first auxiliary and preposed the second, errors that were never observed in humans, and are incompatible with a hierarchical generalization.

The authors conclude that a learner's preference for hierarchy may arise from hierarchical properties of the input, combined with biases inherent in the network's architecture and learning procedure. The paper suggests that a hierarchical constraint may not be necessary for language learners to acquire hierarchical generalizations. They propose that the GRU with attention architecture was able to overcome a linear bias due to the presence of hierarchical cues in the input language.

PDF Markdown

Revisiting the poverty of the stimulus: hierarchical generalization without a hierarchical bias in recurrent neural networks (1802.09091v3)

Summary

Related Papers