Is ChatGPT A Good Translator? Yes With GPT-4 As The Engine (2301.08745v4)

Published 20 Jan 2023 in cs.CL

Abstract: This report provides a preliminary evaluation of ChatGPT for machine translation, including translation prompt, multilingual translation, and translation robustness. We adopt the prompts advised by ChatGPT to trigger its translation ability and find that the candidate prompts generally work well with minor performance differences. By evaluating on a number of benchmark test sets, we find that ChatGPT performs competitively with commercial translation products (e.g., Google Translate) on high-resource European languages but lags behind significantly on low-resource or distant languages. As for the translation robustness, ChatGPT does not perform as well as the commercial systems on biomedical abstracts or Reddit comments but exhibits good results on spoken language. Further, we explore an interesting strategy named $\mathbf{pivot~prompting}$ for distant languages, which asks ChatGPT to translate the source sentence into a high-resource pivot language before into the target language, improving the translation performance noticeably. With the launch of the GPT-4 engine, the translation performance of ChatGPT is significantly boosted, becoming comparable to commercial translation products, even for distant languages. Human analysis on Google Translate and ChatGPT suggests that ChatGPT with GPT-3.5 tends to generate more hallucinations and mis-translation errors while that with GPT-4 makes the least errors. In other words, ChatGPT has already become a good translator. Please refer to our Github project for more details: https://github.com/wxjiao/Is-ChatGPT-A-Good-Translator

PDF Abstract

Evaluating ChatGPT's Machine Translation Capabilities

This paper presents a meticulous evaluation of ChatGPT's performance in machine translation tasks, particularly underpinned by its GPT-4 engine. The paper addresses several critical aspects of ChatGPT's translation capabilities, including prompt design, multilingual handling, and robustness across varied domains. By benchmarking against prevalent commercial systems such as Google Translate, the authors illustrate both the strengths and limitations inherent in ChatGPT's current translation performances.

Core Findings

The research systematically explores ChatGPT's translation efficacy across high-resource European languages and comparatively lower-resource or distant languages. Through testing on various benchmark datasets, it is reported that:

Translation Quality: ChatGPT exhibits competitive results in high-resource European languages. However, its performance declines significantly for low-resource or distant languages, highlighting an area for potential improvement. This discrepancy aligns with the typical challenges faced by models trained with uneven language resource distributions.
Prompt Engineering: The use of different prompts distinctly impacts ChatGPT's translation outcomes. The paper discusses how prompt phrasing can influence translation accuracy, underlining the complex interactions between prompt design and LLM outputs.
Robustness Analysis: When faced with domain-specific or noisy data, such as biomedical abstracts or Reddit comments, ChatGPT's results are less promising compared to specialized commercial systems, though it performs relatively well on spoken language datasets. This dimension highlights ChatGPT's potential in conversational contexts, albeit lacking the robustness required for technical or informal text domains.
Advancements with GPT-4: The introduction of GPT-4 marks a notable enhancement in translation quality, narrowing the gap with specialized systems even for challenging language pairs. This improvement is credited in part to a reduction in hallucination and mis-translation errors previously observed with GPT-3.5.
Pivot Prompting: The paper explores a pivot prompting strategy, where intermediate translations are made via a high-resource language (e.g., English) before reaching the final target language. This approach substantially improves translation accuracy in distant languages by leveraging the model's stronger capabilities in high-resource languages.

Implications and Future Directions

The authors' exploration of translation performance through pivot prompting offers an insightful approach to overcoming the limitations of low-resource language translation, though challenges remain in optimizing inference speed and managing computational overhead. The transition to GPT-4 demonstrates substantial promise, suggesting ongoing improvements in model performance will continue to rival specialized translation systems. However, the paper also identifies avenues for further exploration, such as expanding the scope to include other translation abilities like document-level and context-constrained translations.

While ChatGPT achieves commendable performance in translating within some language pairs, this paper makes it clear that further enhancements are needed for broader applicability. The research lays foundational insights for the continued development of LLMs in translation tasks, pointing to the nuanced challenges of achieving uniform performance across diverse linguistic, cultural, and domain-specific contexts. Future research can build on these findings by investigating optimized prompt strategies and model architectures that enhance performance for low-resource languages, potentially incorporating more advanced multilingual pre-training techniques.