Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Machine Translation Hallucination Detection for Low and High Resource Languages using Large Language Models (2407.16470v3)

Published 23 Jul 2024 in cs.CL and cs.AI

Abstract: Recent advancements in massively multilingual machine translation systems have significantly enhanced translation accuracy; however, even the best performing systems still generate hallucinations, severely impacting user trust. Detecting hallucinations in Machine Translation (MT) remains a critical challenge, particularly since existing methods excel with High-Resource Languages (HRLs) but exhibit substantial limitations when applied to Low-Resource Languages (LRLs). This paper evaluates sentence-level hallucination detection approaches using LLMs and semantic similarity within massively multilingual embeddings. Our study spans 16 language directions, covering HRLs, LRLs, with diverse scripts. We find that the choice of model is essential for performance. On average, for HRLs, Llama3-70B outperforms the previous state of the art by as much as 0.16 MCC (Matthews Correlation Coefficient). However, for LRLs we observe that Claude Sonnet outperforms other LLMs on average by 0.03 MCC. The key takeaway from our study is that LLMs can achieve performance comparable or even better than previously proposed models, despite not being explicitly trained for any machine translation task. However, their advantage is less significant for LRLs.

Machine Translation Hallucination Detection for Low and High Resource Languages using LLMs

This paper investigates the challenge of detecting hallucinations in machine translation (MT) systems, with an emphasis on both high-resource languages (HRLs) and low-resource languages (LRLs). LLMs are evaluated for their efficacy in identifying hallucinations across these languages. The paper spans 16 language pairs, operating under a massively multilingual framework to examine the performance differentials of various models and embedding-based methods.

Background and Problem Statement

Recent advancements in multilingual MT systems have enhanced translation accuracy significantly. Despite these improvements, hallucinations—instances where the model generates information not present in the source text—remain a critical issue, markedly impairing user trust. The detection of hallucinations has predominantly been successful in HRLs, leaving a substantial performance gap when applied to LRLs. The paper assesses a range of LLMs and embedding spaces for hallucination detection, utilizing the \halomi benchmark dataset, which encompasses both HRLs and LRLs to provide a comprehensive evaluation scope.

Methodology

The paper utilizes the \halomi benchmark dataset, conducting a large-scale assessment involving:

  1. LLMs: Eight models with different prompt variations were tested, including GPT4-turbo, GPT4o, Command R, \crplus, Mistral-8x22b, Claude Sonnet, Claude Opus, and \llama.
  2. Embedding Spaces: Four spaces were analyzed—OpenAI's text-embedding-3-large, Cohere's Embed v3, Mistral's mistral-embed, and SONAR (the base for the current SOTA, BLASER-QE).

The evaluation framework considers binary hallucination detection and severity ranking. In the binary detection setting, the performance is measured by Matthews Correlation Coefficient (MCC). The optimal prompt for each LLM was selected based on validation results using EN\leftrightarrowDE directions. For embedding spaces, cosine similarity between source and translated texts was utilized, with thresholds optimized on the validation set.

Key Findings

  1. Performance of LLMs: The paper demonstrates that LLMs exhibit superior performance in hallucination detection across both HRLs and LRLs.
    • For HRLs, \llama significantly outperforms BLASER-QE with an MCC improvement of 16 points.
    • For LRLs, Claude Sonnet marginally surpasses other methods by an average of 0.03 MCC points, although the overall improvement over existing models is smaller.
  2. Embedding-based Methods:
    • Embedding methods remain competitive in high-resource settings, particularly excelling for translation directions involving non-Latin scripts such as AR, RU, and ZH, suggesting high cross-script transfer learning capabilities.
    • SONAR embeddings perform comparably or superior to BLASER-QE in most HRL directions, indicating that the model's performance can be highly dependent on training data quality.
  3. LLRL Performance Discrepancies: No single LLM uniformly excels across all LRL directions.
    • \llama performs best overall, but other models outperform it in specific LRL contexts.
    • For non-English-centric directions, such as ES\leftrightarrowYO, Opus leads, indicating the advanced analytical capabilities of LLMs even in limited relevant training data scenarios.

Implications

The findings underscore the importance of selecting appropriate models based on specific context requirements, especially considering resource levels and translation directions. The significant performance uplift presented by LLMs, despite their lack of explicit training for MT tasks, points to a broader applicability of these models in diverse linguistic contexts. Moreover, the competitive performance of embedding-based methods, particularly in HRLs, suggests their continued relevance in MT quality assessment frameworks.

Future Directions

The paper highlights several avenues for future research:

  • Improved LRL Performance: There remains a need for models that offer robust performance across LRLs, suggesting potential in specialized training or fine-tuning for these languages.
  • Cross-script and Non-English-centric Translation Evaluation: Developing methods that can handle the nuances of non-Latin scripts and non-English-centric translations effectively.
  • Dataset Expansion: Expanding the \halomi dataset to include more diverse and balanced language pairs, addressing the class imbalances observed in the paper.

Conclusion

This work stresses the effectiveness of LLMs and embedded semantic similarity in hallucination detection, establishing new state-of-the-art results for most evaluated language pairs. The research contributes significantly to the understanding of MT hallucination robustness across a wide spectrum of languages and scripts, advocating for future developments that prioritize LRLs and more complex multilingual translation scenarios. This paper is instrumental for the MT research community as it navigates the intricate dynamics of hallucination detection, paving the way for more reliable and trustworthy translation systems.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (25)
  1. Hallucinations in neural machine translation.
  2. Seamlessm4t: Massively multilingual & multimodal machine translation. Preprint, arXiv:2308.11596.
  3. Detecting and mitigating hallucinations in machine translation: Model internal workings alone do well, sentence similarity even better. Proceedings of the Annual Meeting of the Association for Computational Linguistics, 1:36–50.
  4. Halomi: A manually annotated benchmark for multilingual hallucination and omission detection in machine translation. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing.
  5. Sonar: Sentence-level multimodal and language-agnostic representations.
  6. Beyond english-centric multilingual machine translation. Preprint, arXiv:2010.11125.
  7. Language-agnostic bert sentence embedding. Proceedings of the Annual Meeting of the Association for Computational Linguistics, 1:878–891.
  8. The devil is in the errors: Leveraging large language models for fine-grained machine translation evaluation. In Proceedings of the Eighth Conference on Machine Translation, pages 1066–1083, Singapore. Association for Computational Linguistics.
  9. Analysing the classification of imbalanced data-sets with multiple classes: Binarization techniques and ad-hoc approaches. Knowledge-Based Systems, 42:97–110.
  10. xcomet: Transparent machine translation evaluation through fine-grained error detection. Preprint, arXiv:2310.10482.
  11. Looking for a needle in a haystack: A comprehensive study of hallucinations in neural machine translation. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 1059–1075, Dubrovnik, Croatia. Association for Computational Linguistics.
  12. Are large language model-based evaluators the solution to scaling up multilingual evaluation? EACL 2024 - 18th Conference of the European Chapter of the Association for Computational Linguistics, Findings of EACL 2024, pages 1051–1070.
  13. Bitext mining using distilled sentence representations for low-resource languages. Findings of the Association for Computational Linguistics: EMNLP 2022, pages 2101–2112.
  14. Tom Kocmi and Christian Federmann. 2023. Large language models are state-of-the-art evaluators of translation quality. Proceedings of the 24th Annual Conference of the European Association for Machine Translation, EAMT 2023, pages 193–203.
  15. Better zero-shot reasoning with role-play prompting. Preprint, arXiv:2308.07702.
  16. Madlad-400: A multilingual and document-level large audited dataset. Preprint, arXiv:2309.04662.
  17. Lost in the middle: How language models use long contexts. Transactions of the Association for Computational Linguistics, 12:157–173.
  18. The effect of dataset size on training tweet sentiment classifiers. Proceedings - 2015 IEEE 14th International Conference on Machine Learning and Applications, ICMLA 2015, pages 96–102.
  19. Salted: A framework for salient long-tail translation error detection. Findings of the Association for Computational Linguistics: EMNLP 2022, pages 5163–5179.
  20. Margarita Sordo and Qing Zeng. 2005. On sample size and classification accuracy: A performance comparison. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 3745 LNBI:193–201.
  21. No language left behind: Scaling human-centered machine translation. Preprint, arXiv:2207.04672.
  22. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35.
  23. Understanding and Detecting Hallucinations in Neural Machine Translation via Model Introspection. Transactions of the Association for Computational Linguistics, 11:546–564.
  24. INSTRUCTSCORE: Towards explainable text generation evaluation with automatic feedback. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 5967–5994, Singapore. Association for Computational Linguistics.
  25. Multilingual machine translation with large language models: Empirical results and analysis. Preprint, arXiv:2304.04675.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Kenza Benkirane (2 papers)
  2. Laura Gongas (1 paper)
  3. Shahar Pelles (1 paper)
  4. Naomi Fuchs (1 paper)
  5. Joshua Darmon (1 paper)
  6. Pontus Stenetorp (68 papers)
  7. David Ifeoluwa Adelani (59 papers)
  8. Eduardo Sánchez (8 papers)
Citations (3)