Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
153 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Preference-driven Paradigm for Enhanced Translation with Large Language Models (2404.11288v2)

Published 17 Apr 2024 in cs.CL

Abstract: Recent research has shown that LLMs can achieve remarkable translation performance through supervised fine-tuning (SFT) using only a small amount of parallel data. However, SFT simply instructs the model to imitate the reference translations at the token level, making it vulnerable to the noise present in the references. Hence, the assistance from SFT often reaches a plateau once the LLMs have achieved a certain level of translation capability, and further increasing the size of parallel data does not provide additional benefits. To overcome this plateau associated with imitation-based SFT, we propose a preference-based approach built upon the Plackett-Luce model. The objective is to steer LLMs towards a more nuanced understanding of translation preferences from a holistic view, while also being more resilient in the absence of gold translations. We further build a dataset named MAPLE to verify the effectiveness of our approach, which includes multiple translations of varying quality for each source sentence. Extensive experiments demonstrate the superiority of our approach in "breaking the plateau" across diverse LLMs and test settings. Our in-depth analysis underscores the pivotal role of diverse translations and accurate preference scores in the success of our approach.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (61)
  1. Steering large language models for machine translation with finetuning and in-context learning. CoRR, abs/2310.13448.
  2. Rachel Bawden and François Yvon. 2023. Investigating the translation performance of a large multilingual language model: the case of BLOOM. In Proceedings of the 24th Annual Conference of the European Association for Machine Translation, EAMT 2023, Tampere, Finland, 12-15 June 2023, pages 157–170. European Association for Machine Translation.
  3. Ralph Allan Bradley. 1953. Some statistical methods in taste testing and quality evaluation. Biometrics, 9(1):22–38.
  4. Ralph Allan Bradley and Milton E Terry. 1952. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324–345.
  5. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  6. Improving translation faithfulness of large language models via augmenting instructions. arXiv preprint arXiv:2308.12674.
  7. Label ranking methods based on the plackett-luce model. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), June 21-24, 2010, Haifa, Israel, pages 215–222. Omnipress.
  8. Weiwei Cheng and Eyke Hüllermeier. 2008. Learning similarity functions from qualitative feedback. In LNAI 5239 Advances in Case-Based Reasoning: The 9th European Conference on Case-Based Reasoning (ECCBR-08), pages 129–134, Trier, Germany. Springer.
  9. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113.
  10. No language left behind: Scaling human-centered machine translation. CoRR, abs/2207.04672.
  11. Steerlm: Attribute conditioned SFT as an (user-steerable) alternative to RLHF. CoRR, abs/2310.05344.
  12. Results of WMT22 metrics shared task: Stop using BLEU – neural metrics are better and more robust. In Proceedings of the Seventh Conference on Machine Translation (WMT), pages 46–68, Abu Dhabi, United Arab Emirates (Hybrid). Association for Computational Linguistics.
  13. Dictionary-based phrase-level prompting of large language models for machine translation. CoRR, abs/2302.07856.
  14. The many routes to the ubiquitous Bradley-Terry model. (arXiv:2312.13619).
  15. Exploring human-like translation strategy with large language models. CoRR, abs/2305.04118.
  16. Contrastive preference learning: Learning from human feedback without RL. CoRR, abs/2310.13639.
  17. How good are GPT models at machine translation? a comprehensive evaluation. arXiv preprint arXiv:2302.09210.
  18. Detecting various types of noise for neural machine translation. In Findings of the Association for Computational Linguistics: ACL 2022, Dublin, Ireland, May 22-27, 2022, pages 2542–2551. Association for Computational Linguistics.
  19. The curious case of neural text degeneration. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net.
  20. Human feedback is not gold standard. CoRR, abs/2309.16349.
  21. Aligning language models with offline reinforcement learning from human feedback. CoRR, abs/2308.12050.
  22. The 37 implementation details of proximal policy optimization. In ICLR Blog Track. Https://iclr-blog-track.github.io/2022/03/25/ppo-implementation-details/.
  23. Quality management on Amazon Mechanical Turk. In Proceedings of the ACM SIGKDD workshop on human computation, pages 64–67.
  24. Reproducibility of benchmarked deep reinforcement learning tasks for continuous control. CoRR, abs/1708.04133.
  25. Mistral 7b. CoRR, abs/2310.06825.
  26. Parrot: Translating during chat using large language models. arXiv preprint arXiv:2304.02426.
  27. Is ChatGPT a good translator? yes with GPT-4 as the engine. arXiv preprint arXiv:2301.08745.
  28. Huda Khayrallah and Philipp Koehn. 2018. On the impact of various types of noise on neural machine translation. In Proceedings of the 2nd Workshop on Neural Machine Translation and Generation, NMT@ACL 2018, Melbourne, Australia, July 20, 2018, pages 74–83. Association for Computational Linguistics.
  29. Findings of the 2022 conference on machine translation (WMT22). In Proceedings of the Seventh Conference on Machine Translation (WMT), pages 1–45, Abu Dhabi, United Arab Emirates (Hybrid). Association for Computational Linguistics.
  30. R. Duncan Luce. 1959. Individual choice behaviour. John Wiley.
  31. Small data, big impact: Leveraging minimal data for effective machine translation. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2740–2756, Toronto, Canada. Association for Computational Linguistics.
  32. Lucas Maystre and Matthias Grossglauser. 2015. Fast and accurate inference of Plackett-Luce models. In Advances in Neural Information Processing Systems, volume 28.
  33. Frederick Mosteller. 1951. Remarks on the method of paired comparisons: I. the least squares solution assuming equal standard deviations and equal correlations. Psychometrika, 16(1):3–9.
  34. Augmenting large language model translators via translation memories. In Findings of the Association for Computational Linguistics: ACL 2023, pages 10287–10299, Toronto, Canada. Association for Computational Linguistics.
  35. Crosslingual generalization through multitask finetuning. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, pages 15991–16111. Association for Computational Linguistics.
  36. OpenAI. 2023. GPT-4 technical report. CoRR, abs/2303.08774.
  37. Analyzing uncertainty in neural machine translation. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, volume 80 of Proceedings of Machine Learning Research, pages 3953–3962. PMLR.
  38. Training language models to follow instructions with human feedback. In Advances in Neural Information Processing Systems, volume 35, pages 27730–27744. Curran Associates, Inc.
  39. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA. Association for Computational Linguistics.
  40. Robin L Plackett. 1975. The analysis of permutations. Journal of the Royal Statistical Society Series C: Applied Statistics, 24(2):193–202.
  41. Direct preference optimization: Your language model is secretly a reward model. CoRR, abs/2305.18290.
  42. COMET-22: Unbabel-IST 2022 submission for the metrics shared task. In Proceedings of the Seventh Conference on Machine Translation, WMT 2022, Abu Dhabi, United Arab Emirates (Hybrid), December 7-8, 2022, pages 578–585. Association for Computational Linguistics.
  43. Contrastive learning with hard negative samples. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net.
  44. BLOOM: A 176b-parameter open-access multilingual language model. CoRR, abs/2211.05100.
  45. Proximal policy optimization algorithms. CoRR, abs/1707.06347.
  46. Preference ranking optimization for human alignment. CoRR, abs/2306.17492.
  47. Instance needs more care: Rewriting prompts for instances yields better zero-shot performance. CoRR, abs/2310.02107.
  48. Stanford Alpaca: An instruction-following LLaMA model. https://github.com/tatsu-lab/stanford_alpaca.
  49. Louis L Thurstone. 1927. Psychophysical analysis. The American journal of psychology, 38(3):368–389.
  50. LLaMA: Open and efficient foundation language models. CoRR, abs/2302.13971.
  51. Llama 2: Open foundation and fine-tuned chat models. CoRR, abs/2307.09288.
  52. Kenneth Train. 2003. Discrete Choice Method With Simulation. Cambridge University Press.
  53. A paradigm shift in machine translation: Boosting translation performance of large language models. CoRR, abs/2309.11674.
  54. RRHF: rank responses to align language models with human feedback without tears. CoRR, abs/2304.05302.
  55. Tim: Teaching large language models to translate with comparison. arXiv preprint arXiv:2307.04408.
  56. Prompting large language model for machine translation: A case study. In International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pages 41092–41110. PMLR.
  57. The impact of demonstrations on multilingual in-context learning: A multidimensional analysis.
  58. Machine translation with large language models: Prompting, few-shot learning, and fine-tuning with QLoRA. In Proceedings of the Eighth Conference on Machine Translation, pages 466–479, Singapore. Association for Computational Linguistics.
  59. Judging LLM-as-a-judge with MT-Bench and Chatbot Arena. CoRR, abs/2306.05685.
  60. Lima: Less is more for alignment. arXiv preprint arXiv:2305.11206.
  61. Weaker than you think: A critical look at weakly supervised learning. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14229–14253, Toronto, Canada. Association for Computational Linguistics.
Citations (2)

Summary

We haven't generated a summary for this paper yet.