Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
162 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

KpopMT: Translation Dataset with Terminology for Kpop Fandom (2407.07413v1)

Published 10 Jul 2024 in cs.CL

Abstract: While machines learn from existing corpora, humans have the unique capability to establish and accept new language systems. This makes human form unique language systems within social groups. Aligning with this, we focus on a gap remaining in addressing translation challenges within social groups, where in-group members utilize unique terminologies. We propose KpopMT dataset, which aims to fill this gap by enabling precise terminology translation, choosing Kpop fandom as an initiative for social groups given its global popularity. Expert translators provide 1k English translations for Korean posts and comments, each annotated with specific terminology within social groups' language systems. We evaluate existing translation systems including GPT models on KpopMT to identify their failure cases. Results show overall low scores, underscoring the challenges of reflecting group-specific terminologies and styles in translation. We make KpopMT publicly available.

Summary

  • The paper presents a novel dataset addressing translation challenges for Kpop fandom-specific terminology using 1,000 annotated sentence pairs.
  • It evaluates multiple MT systems, showing that even advanced models like GPT-4 struggle with specialized Group-Lexicon terms.
  • The findings highlight the need for targeted MT approaches that integrate socio-linguistic data to preserve the cultural nuances of translations.

KpopMT: Translation Dataset with Terminology for Kpop Fandom

The paper "KpopMT: Translation Dataset with Terminology for Kpop Fandom" presents a novel dataset that addresses the challenge of translating terminology specific to social groups, utilizing the Kpop fandom as a case paper. This research underscores the unique linguistic structures formed within social communities and the challenges these pose to current Machine Translation (MT) systems.

Introduction

Translation tasks often struggle with the specialized lexicons and jargon unique to social groups, which standard MT systems typically overlook. The creative language used in these communities, often described as social dialects, necessitates a more nuanced approach to translation. The KpopMT dataset comprises 1,000 expertly translated Korean-English sentence pairs, specifically annotated with terminologies used within the Kpop fandom. By highlighting the limitations of state-of-the-art translation systems in dealing with such specialized terminology, this paper aims to advance MT research in areas that involve social group-specific lexicons.

Dataset Construction

The KpopMT dataset was developed in two phases: sentence collection and terminology annotation. Sentences rich in fandom-specific terminology were identified from social media and fan-related websites, then translated into English by expert translators familiar with Kpop jargon. Following this, the terminology in each sentence pair was annotated, creating a detailed termbase that categorizes terms into Group-Lexicon (fandom-specific lexicon), Group-NE (named entities within the fandom), and Slang (internet slang).

Table 1 in the paper exemplifies the use of specialized terminologies in both Korean and English, illustrating the complexity involved. This granularity ensures that translation models trained or evaluated on this dataset must engage with the socio-linguistic nuances inherent in the source material.

Evaluation and Results

The authors evaluated several existing MT systems, including open-source models like M2M and mBART, as well as proprietary systems such as Google's Translator and OpenAI's GPT variants, on the KpopMT dataset. Performance metrics included traditional translation quality measures such as BLEU, COMET, and chrF++, alongside terminological accuracy metrics like Exact-Match Term Accuracy (EMA) and 1-TERm.

  • GPT Models: The GPT-4 model achieved the highest EMA score (26.4%) among all tested systems. However, it faced challenges in accurately generating less common Group-Lexicon terms.
  • mBART and Standard Language MT: Systems trained exclusively on general language data showed moderate success in translation quality but fell short in terminological accuracy, emphasizing the need for domain-specific data.
  • Data Adaptation Models: Techniques leveraging fandom-specific monolingual data, such as domain adaptation, showed mixed results. Noise in the back-translation process possibly hindered effective model training.

Implications and Future Research

This research highlights significant gaps in current MT systems, particularly in handling specialized terminologies of social groups. Given that terminological accuracy is crucial for maintaining the cultural and social integrity of translations within these communities, the KpopMT dataset lays the groundwork for more robust and culturally aware MT methodologies.

From a theoretical standpoint, this work suggests that future translation models must better integrate socio-linguistic data and may benefit from more sophisticated data filtering and noise reduction techniques in back-translation processes.

Conclusion

The KpopMT dataset offers a pivotal resource in the pursuit of more accurate and socially aware MT. The low performance of existing systems on KpopMT underscores the importance of developing targeted MT solutions that can understand and accurately translate group-specific lexicons. Future research could expand this dataset to other social groups, fostering further advancements in the domain-specific translation paradigm.

In summary, this paper presents crucial insights and resources for improving MT systems to better serve dynamically evolving social communities, marking an important step towards more contextually intelligent LLMs.