Multilingual Constituency Parsing with Self-Attention and Pre-Training

Published 31 Dec 2018 in cs.CL | (1812.11760v2)

Abstract: We show that constituency parsing benefits from unsupervised pre-training across a variety of languages and a range of pre-training conditions. We first compare the benefits of no pre-training, fastText, ELMo, and BERT for English and find that BERT outperforms ELMo, in large part due to increased model capacity, whereas ELMo in turn outperforms the non-contextual fastText embeddings. We also find that pre-training is beneficial across all 11 languages tested; however, large model sizes (more than 100 million parameters) make it computationally expensive to train separate models for each language. To address this shortcoming, we show that joint multilingual pre-training and fine-tuning allows sharing all but a small number of parameters between ten languages in the final model. The 10x reduction in model size compared to fine-tuning one model per language causes only a 3.2% relative error increase in aggregate. We further explore the idea of joint fine-tuning and show that it gives low-resource languages a way to benefit from the larger datasets of other languages. Finally, we demonstrate new state-of-the-art results for 11 languages, including English (95.8 F1) and Chinese (91.8 F1).

Abstract PDF Chat (Pro)

Citations (239)

View on Semantic Scholar

Summary

The paper shows that integrating BERT with self-attention markedly improves multilingual constituency parsing performance.
It compares pre-training methods, with BERT achieving an English F1 score of 95.70 and ensemble scores near 95.87.
Joint multilingual training enables effective cross-lingual transfer and reduces model size by sharing parameters.

Multilingual Constituency Parsing with Self-Attention and Pre-Training

Overview

The paper, authored by Nikita Kitaev, Steven Cao, and Dan Klein, addresses the enhancement of constituency parsing via unsupervised pre-training across multiple languages with different linguistic resources. The research evaluates various pre-training methods, such as fastText, ELMo, and BERT, and demonstrates the efficacy of BERT in achieving superior parsing accuracy due to its increased model capacity.

Model and Architecture

The parsing model is based on a neural architecture leveraging self-attention as proposed by Kitaev et al. The model computes constituency trees by assigning scores to labeled spans using a neural network. Notably, BERT is integrated into this architecture by employing token representations from its final layer.

The model performs parsing by passing input through self-attention layers and an MLP classifier, optimizing parsing by selecting the highest scoring valid tree with a variant of the CKY algorithm. The paper highlights the utility of BERT, noting that extending BERT with self-attentive layers and an MLP classifier significantly boosts parsing accuracy.

Numerical Results

The experimental results presented are compelling. For English, the BERT\textsubscript{LARGE} model achieved an F1 score of 95.70, outperforming both ELMo and fastText. An ensemble of multiple BERT-based parsers yielded an F1 score of 95.87, underscoring the benefits of pre-training with large model sizes.

In multilingual scenarios, joint pre-training across languages resulted in effective parameter sharing and demonstrated a notable performance increase in low-resource language parsing. The model exhibited a mere 3.2% relative error increase while achieving a 10x reduction in model size through shared multilingual training.

Implications and Future Directions

The research underscores the critical role of pre-trained LLMs in enhancing NLP tasks like constituency parsing, observing significant gains across various languages. The potential to generate high-capacity, multilingual models shows promise in developing scalable NLP solutions for diverse linguistic contexts with minimal computational overhead.

Future developments might focus on refining joint training methodologies to bolster cross-lingual generalization, particularly in cases of restricted labeled data. Additionally, the exploration of more sophisticated tokenization schemes and self-attentive architectures could further improve parsing accuracies.

Conclusion

This paper's findings affirm the significance of high-capacity pre-trained models like BERT for syntactic parsing tasks. Through comprehensive analysis across multiple languages, the study provides a solid foundation for future research aimed at optimizing multilingual NLP systems. Continued exploration into pre-training and fine-tuning paradigms holds potential for substantial advancements in the field.