Machine Translation Hallucination Detection for Low and High Resource Languages using LLMs
This paper investigates the challenge of detecting hallucinations in machine translation (MT) systems, with an emphasis on both high-resource languages (HRLs) and low-resource languages (LRLs). LLMs are evaluated for their efficacy in identifying hallucinations across these languages. The paper spans 16 language pairs, operating under a massively multilingual framework to examine the performance differentials of various models and embedding-based methods.
Background and Problem Statement
Recent advancements in multilingual MT systems have enhanced translation accuracy significantly. Despite these improvements, hallucinations—instances where the model generates information not present in the source text—remain a critical issue, markedly impairing user trust. The detection of hallucinations has predominantly been successful in HRLs, leaving a substantial performance gap when applied to LRLs. The paper assesses a range of LLMs and embedding spaces for hallucination detection, utilizing the \halomi benchmark dataset, which encompasses both HRLs and LRLs to provide a comprehensive evaluation scope.
Methodology
The paper utilizes the \halomi benchmark dataset, conducting a large-scale assessment involving:
- LLMs: Eight models with different prompt variations were tested, including GPT4-turbo, GPT4o, Command R, \crplus, Mistral-8x22b, Claude Sonnet, Claude Opus, and \llama.
- Embedding Spaces: Four spaces were analyzed—OpenAI's text-embedding-3-large, Cohere's Embed v3, Mistral's mistral-embed, and SONAR (the base for the current SOTA, BLASER-QE).
The evaluation framework considers binary hallucination detection and severity ranking. In the binary detection setting, the performance is measured by Matthews Correlation Coefficient (MCC). The optimal prompt for each LLM was selected based on validation results using ENDE directions. For embedding spaces, cosine similarity between source and translated texts was utilized, with thresholds optimized on the validation set.
Key Findings
- Performance of LLMs: The paper demonstrates that LLMs exhibit superior performance in hallucination detection across both HRLs and LRLs.
- For HRLs, \llama significantly outperforms BLASER-QE with an MCC improvement of 16 points.
- For LRLs, Claude Sonnet marginally surpasses other methods by an average of 0.03 MCC points, although the overall improvement over existing models is smaller.
- Embedding-based Methods:
- Embedding methods remain competitive in high-resource settings, particularly excelling for translation directions involving non-Latin scripts such as AR, RU, and ZH, suggesting high cross-script transfer learning capabilities.
- SONAR embeddings perform comparably or superior to BLASER-QE in most HRL directions, indicating that the model's performance can be highly dependent on training data quality.
- LLRL Performance Discrepancies: No single LLM uniformly excels across all LRL directions.
- \llama performs best overall, but other models outperform it in specific LRL contexts.
- For non-English-centric directions, such as ESYO, Opus leads, indicating the advanced analytical capabilities of LLMs even in limited relevant training data scenarios.
Implications
The findings underscore the importance of selecting appropriate models based on specific context requirements, especially considering resource levels and translation directions. The significant performance uplift presented by LLMs, despite their lack of explicit training for MT tasks, points to a broader applicability of these models in diverse linguistic contexts. Moreover, the competitive performance of embedding-based methods, particularly in HRLs, suggests their continued relevance in MT quality assessment frameworks.
Future Directions
The paper highlights several avenues for future research:
- Improved LRL Performance: There remains a need for models that offer robust performance across LRLs, suggesting potential in specialized training or fine-tuning for these languages.
- Cross-script and Non-English-centric Translation Evaluation: Developing methods that can handle the nuances of non-Latin scripts and non-English-centric translations effectively.
- Dataset Expansion: Expanding the \halomi dataset to include more diverse and balanced language pairs, addressing the class imbalances observed in the paper.
Conclusion
This work stresses the effectiveness of LLMs and embedded semantic similarity in hallucination detection, establishing new state-of-the-art results for most evaluated language pairs. The research contributes significantly to the understanding of MT hallucination robustness across a wide spectrum of languages and scripts, advocating for future developments that prioritize LRLs and more complex multilingual translation scenarios. This paper is instrumental for the MT research community as it navigates the intricate dynamics of hallucination detection, paving the way for more reliable and trustworthy translation systems.