Linguistic Knowledge and Transferability of Contextual Representations (1903.08855v5)

Published 21 Mar 2019 in cs.CL

Abstract: Contextual word representations derived from large-scale neural LLMs are successful across a diverse set of NLP tasks, suggesting that they encode useful and transferable features of language. To shed light on the linguistic knowledge they capture, we study the representations produced by several recent pretrained contextualizers (variants of ELMo, the OpenAI transformer LLM, and BERT) with a suite of seventeen diverse probing tasks. We find that linear models trained on top of frozen contextual representations are competitive with state-of-the-art task-specific models in many cases, but fail on tasks requiring fine-grained linguistic knowledge (e.g., conjunct identification). To investigate the transferability of contextual word representations, we quantify differences in the transferability of individual layers within contextualizers, especially between recurrent neural networks (RNNs) and transformers. For instance, higher layers of RNNs are more task-specific, while transformer layers do not exhibit the same monotonic trend. In addition, to better understand what makes contextual word representations transferable, we compare LLM pretraining with eleven supervised pretraining tasks. For any given task, pretraining on a closely related task yields better performance than LLM pretraining (which is better on average) when the pretraining dataset is fixed. However, LLM pretraining on more data gives the best results.

View on arXiv

Authors (5)

Nelson F. Liu (19 papers)
Matt Gardner (57 papers)
Yonatan Belinkov (111 papers)
Matthew E. Peters (27 papers)
Noah A. Smith (224 papers)

Citations (698)

View on Semantic Scholar

Summary

Linguistic Knowledge and Transferability of Contextual Representations: An Overview

The paper "Linguistic Knowledge and Transferability of Contextual Representations" by Liu et al. explores the capabilities of contextual word representations (cwr), such as those generated by ELMo, the OpenAI transformer, and BERT, in capturing and transferring linguistic knowledge across various NLP tasks. The authors utilize a series of seventeen probing tasks to analyze the linguistic properties encoded within these representations, exploring both their strengths and limitations.

Key Findings

The paper presents several significant findings regarding the capabilities of various pretrained contextualizers:

Transferability Across Tasks:
- General Observations: Linear models trained on top of frozen contextual representations are often competitive with state-of-the-art task-specific models. However, these linear models fall short on tasks requiring fine-grained linguistic knowledge.
- Task-Specific Failures: For tasks like conjunct identification, where fine-grained syntactic distinctions are crucial, pretrained cwr fail to perform effectively. This suggests that while pretrained contextualizers capture broad linguistic features, additional task-specific training might be necessary for capturing more nuanced information.
Layerwise Transferability:
- RNN vs. Transformer Layers: The paper reaffirms that higher layers of recurrent neural networks (RNNs), such as ELMo, are more task-specific and less generalizable, whereas transformer-based models like BERT show stronger transferability in their middle layers.
- Impact of Pretraining Task: LLM pretraining is superior on average compared to other tasks, but for specific target tasks, pretraining on related tasks can yield better performance. Nonetheless, pretraining on larger datasets, as done with LLMing, generally provides the best results.
Effectiveness of Probing Models:
- Probing Model Variants: Probing models with additional parameters (e.g., MLP or task-trained LSTMs) achieve higher performance than linear models, especially on tasks needing more complex syntactic information. This supports the notion that certain linguistic features require models to learn task-specific contextual features.
- Comparison to State-of-the-Art: Probing models trained on cwr often rival or surpass the performance of state-of-the-art models for various tasks, indicating the richness of the information encoded in these representations, despite certain limitations.

Methodological Insights

Liu et al. employ a comprehensive methodology to assess the linguistic knowledge and transferability of cwr. They compare the performance of different layers within contextualizers across various probing tasks, such as part-of-speech tagging, syntactic dependency arc classification, and coreference arc prediction. These tasks span different aspects of linguistic knowledge, including syntax, semantics, and coreference.

The use of ELMo, OpenAI transformer, and BERT allows the authors to examine how differences in architecture and pretraining objectives affect performance. Their controlled experiments demonstrate the advantages of transformer-based models, especially for tasks requiring higher-level semantic information.

Practical and Theoretical Implications

The paper has both practical and theoretical implications:

Model Architecture Design: The findings suggest that hybrid approaches combining RNNs and transformers might leverage the strengths of both architectures. Moreover, understanding the trade-offs between generality and task-specificity in different layers can guide the design of better encoder architectures.
Pretraining Tasks: The results highlight the importance of pretraining on large corpora with appropriate self-supervised tasks like LLMing. Future research can explore more sophisticated pretraining tasks that might capture aspects of language not adequately addressed by current methods.
Probing Models: The superiority of probing models with additional task-specific contextualization suggests future directions for fine-tuning and adapting pretrained models to specific NLP tasks.

Future Directions

The paper opens several avenues for future research:

Robustness to Noise and Data Augmentation: Investigating the robustness of cwr to noisy input and the effectiveness of data augmentation methods could further enhance the transferability of these representations.
Multilingual and Cross-lingual Transfer: Exploring how cwr perform across different languages and the effectiveness of cross-lingual transfer can provide insights into the universality of linguistic features captured by these models.
Dynamic Entity Representations: Addressing the current limitations in capturing entity and coreference information, future work could integrate explicit entity representations to improve performance on tasks involving named entities and coreference resolution.

In conclusion, the paper by Liu et al. provides a thorough analysis of the capabilities and limitations of contemporary cwr, offering valuable insights for improving NLP models. By systematically evaluating various pretrained models across a diverse set of tasks, the paper contributes significantly to our understanding of how these representations capture and transfer linguistic knowledge.

PDF Markdown

Related Papers

Find Related Papers