- The paper shows that RNNs can acquire hierarchical generalizations in English question formation without built-in hierarchical bias.
- The study compares various RNN architectures, revealing that only the GRU with attention consistently focuses on hierarchical cues over linear ones.
- The research finds that input properties like subject-verb agreement significantly enhance hierarchical rule learning in neural networks.
This paper investigates the necessity of innate hierarchical constraints in language learners, specifically addressing the poverty of the stimulus (POS) argument. The authors simulate the acquisition of English subject-auxiliary inversion using recurrent neural networks (RNNs) to determine if these networks can learn hierarchical rules without explicit pre-existing hierarchical constraints.
The core argument revolves around the idea that children acquiring language consistently prefer hierarchical syntactic rules over linear ones, even when exposed to simple examples compatible with both. The authors explore whether RNNs, which are not inherently constrained to hierarchical processing, can learn hierarchical generalizations through exposure to linguistic input.
The paper focuses on the transformation of declarative sentences into questions, as in "My walrus can giggle" becoming "Can my walrus giggle?". Two potential rules for this transformation are considered: a hierarchical rule (moving the main verb's auxiliary) and a linear rule (moving the linearly first auxiliary). While both rules work for simple sentences, they diverge in complex sentences with relative clauses, such as "My walrus that will eat can giggle." The hierarchical rule correctly predicts "Can my walrus that will eat giggle?", while the linear rule incorrectly predicts "*Will my walrus that eat can giggle?".
The authors trained RNNs on two fragments of English: a "no-agreement" language and an "agreement" language. The no-agreement language allows any noun to appear with any auxiliary. The agreement language enforces subject-verb agreement using "do," "don't," "does," and "doesn't" as auxiliaries. The agreement language includes an additional cue to hierarchy; in the example "the walruses that the newt does confuse do high_five your peacocks", do agrees with its hierarchically-determined plural subject walruses even though the singular noun newt is linearly closer to it. The networks were trained on identity and question formation tasks, withholding question formation for sentences with relative clauses on the subject during training.
The following RNN architectures were explored:
- Simple Recurrent Network (SRN)
- SRN with attention
- Gated Recurrent Unit (GRU)
- GRU with attention
- Long Short-Term Memory (LSTM)
- LSTM with attention
The sequence-to-sequence model used includes an encoder and a decoder. The encoder processes the input sentence to create a vector representation, and the decoder generates the output sentence based on this encoding. Attention mechanisms were also incorporated, giving the decoder access to intermediate steps of the encoding process.
Key findings include:
- Agreement leads to more robust hierarchical generalization: All six architectures were significantly more likely (p<0.01) to choose the main auxiliary when trained on the agreement language than the no-agreement language.
- Initialization matters: Accuracy often varied considerably across random initializations.
- Different architectures perform qualitatively differently: Only the GRU with attention showed a strong preference for choosing the main auxiliary instead of the linearly first auxiliary. The vanilla GRU chose the first auxiliary nearly 100% of the time.
To understand these differences, the authors analyzed the final hidden state of the encoder, referred to as the encoding of the sentence, focusing on how much information these encodings contained about the main auxiliary, the fourth word, and the subject noun. Linear classifiers were trained to predict these properties from the sentence encodings. The results indicated that while most architectures could identify the main auxiliary, the GRU with attention was unique in performing poorly on the fourth word and subject noun tasks. This suggests that the GRU with attention focused on hierarchical information, ignoring linear cues.
Further analysis of the full questions produced by the networks revealed that the GRU with attention networks' errors sometimes aligned with errors made by children learning English. However, the networks also made mistakes that humans never would. For example, the network preposed the second auxiliary but did not delete either of the auxiliaries. This error type is common among English-learning children and is compatible with hierarchical generalization. In another frequent error type, the network deleted the first auxiliary and preposed the second, errors that were never observed in humans, and are incompatible with a hierarchical generalization.
The authors conclude that a learner's preference for hierarchy may arise from hierarchical properties of the input, combined with biases inherent in the network's architecture and learning procedure. The paper suggests that a hierarchical constraint may not be necessary for language learners to acquire hierarchical generalizations. They propose that the GRU with attention architecture was able to overcome a linear bias due to the presence of hierarchical cues in the input language.