Cross-Lingual Ability of Multilingual BERT: An Empirical Study (1912.07840v2)

Published 17 Dec 2019 in cs.CL, cs.AI, and cs.LG

Abstract: Recent work has exhibited the surprising cross-lingual abilities of multilingual BERT (M-BERT) -- surprising since it is trained without any cross-lingual objective and with no aligned data. In this work, we provide a comprehensive study of the contribution of different components in M-BERT to its cross-lingual ability. We study the impact of linguistic properties of the languages, the architecture of the model, and the learning objectives. The experimental study is done in the context of three typologically different languages -- Spanish, Hindi, and Russian -- and using two conceptually different NLP tasks, textual entailment and named entity recognition. Among our key conclusions is the fact that the lexical overlap between languages plays a negligible role in the cross-lingual success, while the depth of the network is an integral part of it. All our models and implementations can be found on our project page: http://cogcomp.org/page/publication_view/900 .

PDF Abstract

Evaluation of Cross-Lingual Capabilities in Multilingual BERT

The paper "Cross-Lingual Ability of Multilingual BERT: An Empirical Study" offers an in-depth analysis of the factors underpinning the cross-lingual capabilities of the Multilingual Bidirectional Encoder Representations from Transformers (BERT). Despite multilingual BERT being trained on raw Wikipedia text in 104 languages without cross-lingual supervision or explicitly aligned data, it displays an impressive capacity for cross-lingual tasks. This paper explores the components contributing to this capability through extensive experiments encompassing various linguistic properties, model architectures, and learning objectives.

Linguistic Properties

A primary area of investigation concerns the influence of linguistic similarities between source and target languages. Historically, hypotheses posited by Pires et al. (2019) and Wu et al. (2019) suggest the cross-lingual strength of BERT arises from shared word-pieces across languages. Contrarily, the paper's experiments with a constructed Fake-English language—sharing no word-pieces with any other language—evidence that word-piece overlap plays a minor role in cross-lingual success. The findings highlight the importance of structural similarities, such as word-ordering and potentially higher-order co-occurrence frequencies, in facilitating cross-lingual transfer, more so than simple lexical overlap or word frequency similarities.

Model Architecture

The paper explores the architecture of BERT to determine which elements are vital for cross-lingual performance:

Depth: Results suggest that network depth significantly enhances cross-lingual capabilities, owing to its impact on extracting semantic and structural features critical for successful language transfer.
Size of Model Parameters: While the number of parameters is not as pivotal as depth, there appears to be a requisite minimum, below which cross-lingual performance drops.
Multi-head Attention: Surprisingly, the number of attention heads is largely inconsequential for cross-lingual efficacy. A single attention head suffices, aligning with recent analyses claiming most attention heads in BERT might not be crucial.

Input Representation and Learning Objective

Regarding learning objectives, the paper finds that the Next Sentence Prediction (NSP) task does not benefit cross-lingual performance. Furthermore, incorporating language identity markers into inputs has negligible effects on performance, suggesting that BERT can inherently distinguish different languages without explicit markers. As for input representation, word-level and word-piece tokenization show superior effectiveness compared to character-level tokenization, likely due to richer contextual information embedded within these units.

Implications and Future Directions

The findings propose several implications for the design of cross-lingual NLP models:

Deprioritizing Word-piece Overlap: Researchers should focus on capturing deeper structural similarities over lexical overlaps when developing cross-lingual models.
Focusing on Structural Features: Future studies might refine the understanding of 'structural similarity' to identify components beneficial for cross-lingual transfer.
Optimizing Architectural Trade-offs: Given the negligible impact of multi-head attentions, future models could potentially be streamlined without extensive multi-head configurations, while maintaining or increasing network depth.
Reassessing Learning Objectives: The role of NSP-like objectives should be reconsidered, given evidence of their negative impact on performance.

This investigation, through rigorous empirical analysis, challenges preconceived notions regarding cross-lingual learning and opens prospects for more refined and efficient multilingual models. Further research should aim at dissecting the structural similarities that truly drive cross-lingual generalization, and assess the impact of related languages in multilingual BERT models. These insights can substantially contribute to the refinement and development of advanced cross-lingual transfer models in NLP.

PDF Markdown Bookmark Chat (Pro)

Authors (4)

Karthikeyan K (9 papers)
Zihan Wang (181 papers)
Stephen Mayhew (12 papers)
Dan Roth (222 papers)

Citations (322)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

https://twitter.com/mayhewsw/status/1780541231199629356