Adaptation of Deep Bidirectional Multilingual Transformers for Russian Language (1905.07213v1)

Published 17 May 2019 in cs.CL

Abstract: The paper introduces methods of adaptation of multilingual masked LLMs for a specific language. Pre-trained bidirectional LLMs show state-of-the-art performance on a wide range of tasks including reading comprehension, natural language inference, and sentiment analysis. At the moment there are two alternative approaches to train such models: monolingual and multilingual. While language specific models show superior performance, multilingual models allow to perform a transfer from one language to another and solve tasks for different languages simultaneously. This work shows that transfer learning from a multilingual model to monolingual model results in significant growth of performance on such tasks as reading comprehension, paraphrase detection, and sentiment analysis. Furthermore, multilingual initialization of monolingual model substantially reduces training time. Pre-trained models for the Russian language are open sourced.

Authors (2)

Yuri Kuratov (14 papers)
Mikhail Arkhipov (11 papers)

Citations (269)

View on Semantic Scholar

Summary

Adaptation of Deep Bidirectional Multilingual Transformers for Russian Language

The paper "Adaptation of Deep Bidirectional Multilingual Transformers for Russian Language" investigates the adaptation techniques for multilingual masked LLMs to focus on a specific language, employing the Russian language as the case paper. The research builds upon the robust frameworks provided by foundational models such as BERT, a bidirectional transformer pre-trained using vast amounts of textual data.

Overview of Methodology

The authors leverage transfer learning to initiate a monolingual model from a pre-existing multilingual model. This approach effectively enhances performance across several NLP tasks in Russian, such as reading comprehension, paraphrase detection, and sentiment analysis. Moreover, initializing the monolingual model using a multilingual model significantly accelerates the training process. Notably, the paper outlines the construction of a new subword vocabulary targeted specifically for Russian, based on a subset of data from the Russian Wikipedia and news sources. This step is crucial as it mitigates the inefficiencies associated with the subword segmentation in multilingual models, which otherwise results in longer token sequences and increased computational overhead.

Key Experimental Findings

The experimental results highlight substantial improvements facilitated by the proposed methodology:

Paraphrase Identification: On the ParaPhraser dataset, RuBERT (Russian specific BERT) achieved an F-1 score of 87.73 and an accuracy of 84.99, outperforming the multilingual BERT by a notable margin.
Sentiment Analysis: Utilizing the RuSentiment dataset, RuBERT demonstrated a superior F-1 score of 72.63 compared to 70.82 by the multilingual model.
Question Answering: For the SDSJ Task B dataset, the RuBERT model surpassed the multilingual BERT with F-1 and Exact Match (EM) scores of 84.60 and 66.30, respectively.

These results underscore that the monolingual adaptation of the BERT model using multilingual initialization yields enhanced performance, particularly in tasks where the training data aligns more closely with the domain of the adapted model.

Implications and Future Directions

The findings of this paper have both theoretical and practical implications. Theoretically, it affirms the potential of transfer learning in reducing the training time for language-specific models while retaining, and even exceeding, the performance quality observed in multilingual models. Practically, this opens avenues for developing more efficient language-specific models that can be adapted from existing multilingual frameworks without the prohibitive cost and computational resources typically associated with training from scratch.

For future exploration, the field could benefit from a systematic examination of the scalability of this approach across other languages. Additionally, an investigation into whether similar transfer learning strategies could further enhance other LLMs like GPT, RoBERTa, or their successors would be worthwhile. Exploring the impact of varying the size and quality of the initial multilingual corpus on the performance of adapted models also holds great promise for refining these methods.

In conclusion, this paper provides an insightful contribution to the domain of NLP by demonstrating an effective methodology for adapting multilingual transformer models to serve monolingual needs, paving the way for more accessible and high-performance language-specific NLP solutions.

PDF Markdown

Related Papers

Find Related Papers