Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

CORI: CJKV Benchmark with Romanization Integration -- A step towards Cross-lingual Transfer Beyond Textual Scripts (2404.12618v1)

Published 19 Apr 2024 in cs.CL, cs.AI, and cs.LG

Abstract: Naively assuming English as a source language may hinder cross-lingual transfer for many languages by failing to consider the importance of language contact. Some languages are more well-connected than others, and target languages can benefit from transferring from closely related languages; for many languages, the set of closely related languages does not include English. In this work, we study the impact of source language for cross-lingual transfer, demonstrating the importance of selecting source languages that have high contact with the target language. We also construct a novel benchmark dataset for close contact Chinese-Japanese-Korean-Vietnamese (CJKV) languages to further encourage in-depth studies of language contact. To comprehensively capture contact between these languages, we propose to integrate Romanized transcription beyond textual scripts via Contrastive Learning objectives, leading to enhanced cross-lingual representations and effective zero-shot cross-lingual transfer.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Hoang H. Nguyen (9 papers)
  2. Chenwei Zhang (60 papers)
  3. Ye Liu (153 papers)
  4. Natalie Parde (11 papers)
  5. Eugene Rohrbaugh (2 papers)
  6. Philip S. Yu (592 papers)
Citations (1)

Summary

We haven't generated a summary for this paper yet.