Mitigating the Linguistic Gap with Phonemic Representations for Robust Cross-lingual Transfer (2402.14279v3)
Abstract: Approaches to improving multilingual language understanding often struggle with significant performance gaps between high-resource and low-resource languages. While there are efforts to align the languages in a single latent space to mitigate such gaps, how different input-level representations influence such gaps has not been investigated, particularly with phonemic inputs. We hypothesize that the performance gaps are affected by representation discrepancies between these languages, and revisit the use of phonemic representations as a means to mitigate these discrepancies. To demonstrate the effectiveness of phonemic representations, we present experiments on three representative cross-lingual tasks on 12 languages in total. The results show that phonemic representations exhibit higher similarities between languages compared to orthographic representations, and it consistently outperforms grapheme-based baseline model on languages that are relatively low-resourced. We present quantitative evidence from three cross-lingual tasks that demonstrate the effectiveness of phonemic representations, and it is further justified by a theoretical analysis of the cross-lingual performance gap.
- Nash equilibria and pitfalls of adversarial training in adversarial robustness games. In International Conference on Artificial Intelligence and Statistics, pages 9607–9636. PMLR.
- A theory of learning from different domains. Machine Learning, 79:151–175.
- Analysis of representations for domain adaptation. In Advances in Neural Information Processing Systems, volume 19. MIT Press.
- Phonologically aware neural model for named entity recognition in low resource transfer settings. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1462–1472.
- Adapting word embeddings to new languages with morphological and phonological subword representations. arXiv preprint arXiv:1808.09500.
- AdvPicker: Effectively Leveraging Unlabeled Data via Adversarial Discriminator for Cross-Lingual NER. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 743–753, Online. Association for Computational Linguistics.
- Adversarial deep averaging networks for cross-lingual sentiment classification. Transactions of the Association for Computational Linguistics, 6:557–570.
- InfoXLM: An information-theoretic framework for cross-lingual language model pre-training. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3576–3588, Online. Association for Computational Linguistics.
- Canine: Pre-training an efficient tokenization-free encoder for language representation. Transactions of the Association for Computational Linguistics, 10:73–91.
- Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8440–8451.
- Alexis Conneau and Guillaume Lample. 2019. Cross-lingual language model pretraining. Advances in neural information processing systems, 32.
- Xnli: Evaluating cross-lingual sentence representations. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics.
- Ryan Cotterell and Georg Heigold. 2017. Cross-lingual character-level neural morphological tagging. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 748–759, Copenhagen, Denmark. Association for Computational Linguistics.
- Phoneme level language models for sequence based low resource asr. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6091–6095. IEEE.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
- Universal neural machine translation for extremely low resource languages. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 344–354, New Orleans, Louisiana. Association for Computational Linguistics.
- Xtreme: A massively multilingual multi-task benchmark for evaluating cross-lingual generalisation. In International Conference on Machine Learning, pages 4411–4421. PMLR.
- Phoneme-based contextualization for cross-lingual speech recognition in end-to-end models. arXiv preprint arXiv:1906.09292.
- Detecting change in data streams. In Proceedings of the Thirtieth International Conference on Very Large Data Bases - Volume 30, VLDB ’04, page 180–191. VLDB Endowment.
- Word translation without parallel data. In International Conference on Learning Representations.
- Ilya Loshchilov and Frank Hutter. 2018. Decoupled weight decay regularization. In International Conference on Learning Representations.
- Mixed precision training. In International Conference on Learning Representations.
- Epitran: Precision G2P for many languages. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan. European Language Resources Association (ELRA).
- Xphonebert: A pre-trained multilingual model for phoneme representations for text-to-speech. arXiv preprint arXiv:2305.19709.
- Cross-lingual name tagging and linking for 282 languages. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1946–1958, Vancouver, Canada. Association for Computational Linguistics.
- Massively multilingual transfer for ner. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 151–164.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
- Polylm: An open source polyglot large language model. arXiv preprint arXiv:2307.06018.
- On learning universal representations across languages. In International Conference on Learning Representations.
- Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100.
- Shijie Wu and Mark Dredze. 2020. Are all languages created equal in multilingual BERT? In Proceedings of the 5th Workshop on Representation Learning for NLP, pages 120–130, Online. Association for Computational Linguistics.
- Enhancing cross-lingual transfer by manifold mixup. arXiv preprint arXiv:2205.04182.
- Bridging the gap between native text and translated text through adversarial learning: A case study on cross-lingual event extraction. In Findings of the Association for Computational Linguistics: EACL 2023, pages 754–769, Dubrovnik, Croatia. Association for Computational Linguistics.
- Breaking physical and linguistic borders: Multilingual federated prompt tuning for low-resource languages. In International Workshop on Federated Learning in the Age of Foundation Models in Conjunction with NeurIPS 2023.
- Consistency regularization for cross-lingual fine-tuning. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 3403–3417, Online. Association for Computational Linguistics.
- Haeji Jung (4 papers)
- Changdae Oh (12 papers)
- Jooeon Kang (3 papers)
- Jimin Sohn (4 papers)
- Kyungwoo Song (38 papers)
- Jinkyu Kim (51 papers)
- David R. Mortensen (40 papers)