Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Steering Large Language Models for Machine Translation with Finetuning and In-Context Learning (2310.13448v1)

Published 20 Oct 2023 in cs.CL

Abstract: LLMs are a promising avenue for machine translation (MT). However, current LLM-based MT systems are brittle: their effectiveness highly depends on the choice of few-shot examples and they often require extra post-processing due to overgeneration. Alternatives such as finetuning on translation instructions are computationally expensive and may weaken in-context learning capabilities, due to overspecialization. In this paper, we provide a closer look at this problem. We start by showing that adapter-based finetuning with LoRA matches the performance of traditional finetuning while reducing the number of training parameters by a factor of 50. This method also outperforms few-shot prompting and eliminates the need for post-processing or in-context examples. However, we show that finetuning generally degrades few-shot performance, hindering adaptation capabilities. Finally, to obtain the best of both worlds, we propose a simple approach that incorporates few-shot examples during finetuning. Experiments on 10 language pairs show that our proposed approach recovers the original few-shot capabilities while keeping the added benefits of finetuning.

This paper investigates methods for adapting LLMs, specifically LLaMA 7B and 13B, for Machine Translation (MT), comparing few-shot prompting, traditional full finetuning, and parameter-efficient finetuning using Low-Rank Adaptation (LoRA). It aims to find a balance between translation quality, computational efficiency, and the ability to adapt using in-context examples (Alves et al., 2023 ).

Key Findings and Implementation Details:

  1. Efficient Finetuning with LoRA:
    • LoRA proves highly effective for specializing LLMs for MT. It matches the translation quality (measured by COMET, BLEU, etc.) of traditional full finetuning while training significantly fewer parameters (reportedly 50x fewer for LLaMA 7B: 134M vs 6.7B).
    • Practical Implication: This drastically reduces the computational resources (GPU memory, time) needed for finetuning, making it more accessible. You can achieve strong MT performance without needing to update the entire LLM.
    • LoRA-finetuned models outperform few-shot prompting on pretrained LLMs, even when finetuned on relatively small amounts of parallel data (as few as 2,000 examples showed gains).
    • Finetuning (both full and LoRA) inherently resolves the "overgeneration" issue common with prompted pretrained LLMs, where models continue generating text beyond the translation. Finetuned models learn to stop appropriately (using EOS token), eliminating the need for manual post-processing like cutting off text after the first newline character (See Figure 4, Appendix F).
  2. Degradation of Few-Shot Capability after Standard Finetuning:
    • A significant drawback found is that standard finetuning (even efficient LoRA finetuning) degrades the LLM's ability to utilize few-shot examples provided at inference time.
    • Observation: When a model finetuned without few-shot examples in its training data is given few-shot examples during inference, its performance often drops below its zero-shot performance.
    • Impact: This hinders the model's ability to adapt on-the-fly to specific domains, styles, or terminology using provided examples, which is a key advantage of LLMs. The degradation was observed across general (Flores, WMT) and specialized domains (Medical, Law, Tico, Chat) (See Figure 3).
  3. Proposed Solution: Finetuning with Few-Shot Examples:
    • To address the degradation, the paper proposes a simple yet effective method: include few-shot examples during the finetuning process.
    • Implementation: Modify the training data format. For each training instance, randomly sample between 0 and 5 relevant translation examples (from a held-out pool) and format them into the instruction prompt along with the source sentence to be translated. Finetuning then proceeds using LoRA on this mixed data (zero-shot and few-shot instructions).
    • Prompt Format: Appendix A details prompt templates. A successful format separates the examples section from the final translation task instruction (See Table 3, Format 2).
      1
      2
      3
      4
      5
      6
      7
      8
      9
      10
      11
      12
      
      Consider the following N translations from X to Y.
      Example 1
      Source: ...
      Target: ...
      ...
      Example N
      Source: ...
      Target: ...
      
      Translate the source text from X to Y.
      Source: ...
      Target: [Model generates this]
    • Result: This "finetuning with few-shot examples" approach successfully recovers the model's in-context learning ability. When evaluated, providing few-shot examples at inference time improves performance over zero-shot, even on specialized domains (See Figure 3).
    • Benefit: This method achieves the "best of both worlds": the high zero-shot performance and improved output formatting from finetuning, combined with the adaptive capabilities of in-context learning.
  4. Analysis of In-Context Example Influence:
    • While the proposed finetuning method restores the average benefit of few-shot examples, the effect per instance varies.
    • Pros: Few-shot examples can correct significant errors, such as translating into the wrong language (See Table 9, Appendix E).
    • Cons: Sometimes, providing few-shot examples (even to the model finetuned with examples) can introduce hallucinations or degrade an otherwise correct zero-shot translation (See Table 10, Appendix E).
    • Mitigation: Finetuning with few-shot examples significantly reduces the rate of these few-shot-induced hallucinations compared to models finetuned without examples, but does not eliminate them completely (See Table 1). This suggests the training helps the model become more robust but not perfectly immune to misleading examples.

Practical Summary for Implementation:

  • Use LoRA for efficiently finetuning LLMs like LLaMA for MT tasks. It offers performance comparable to full finetuning at a fraction of the computational cost.
  • Standard LoRA finetuning improves zero-shot MT quality and fixes overgeneration but harms the ability to adapt using few-shot examples at inference.
  • To retain adaptability, mix few-shot examples into the LoRA finetuning data. Randomly include 0-5 examples in the instruction prompt for each training instance.
  • This combined approach yields a model that performs well zero-shot, doesn't overgenerate, and can effectively leverage few-shot examples provided at inference time for domain/style adaptation.
  • Be aware that even with this improved finetuning, providing few-shot examples at inference can occasionally introduce errors or hallucinations, although the rate is reduced.

The paper provides detailed hyperparameters (Appendix A.2) and evaluation results across multiple metrics and datasets (Appendix G) for LLaMA 7B and 13B on 10 language pairs (mostly English-centric).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (33)
  1. In-context examples selection for machine translation.
  2. Roee Aharoni and Yoav Goldberg. 2020. Unsupervised domain clusters in pretrained language models. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7747–7763, Online. Association for Computational Linguistics.
  3. TICO-19: the Translation initiative for COvid-19. arXiv:2007.01788.
  4. Massively multilingual neural machine translation in the wild: Findings and challenges. CoRR, abs/1907.05019.
  5. Does the whole exceed its parts? the effect of ai explanations on complementary team performance. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, CHI ’21, New York, NY, USA. Association for Computing Machinery.
  6. Rachel Bawden and François Yvon. 2023. Investigating the translation performance of a large multilingual language model: the case of bloom.
  7. Language models are few-shot learners. In Advances in neural information processing systems, volume 33, page 1877–1901. Curran Associates, Inc.
  8. Palm: Scaling language modeling with pathways.
  9. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
  10. Findings of the WMT 2022 shared task on chat translation. In Proceedings of the Seventh Conference on Machine Translation (WMT), pages 724–743, Abu Dhabi, United Arab Emirates (Hybrid). Association for Computational Linguistics.
  11. The unreasonable effectiveness of few-shot learning for machine translation.
  12. RealToxicityPrompts: Evaluating neural toxic degeneration in language models. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 3356–3369, Online. Association for Computational Linguistics.
  13. Hallucinations in large multilingual translation models.
  14. How good are gpt models at machine translation? a comprehensive evaluation.
  15. Parameter-efficient transfer learning for NLP. In Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 2790–2799. PMLR.
  16. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations.
  17. Hallucinations in neural machine translation.
  18. Eliciting the translation ability of large language models via multilingual finetuning with translation instructions.
  19. Rethinking the role of demonstrations: What makes in-context learning work? In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 11048–11064, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  20. No language left behind: Scaling human-centered machine translation.
  21. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA. Association for Computational Linguistics.
  22. Maja Popović. 2015. chrF: character n-gram F-score for automatic MT evaluation. In Proceedings of the Tenth Workshop on Statistical Machine Translation, pages 392–395, Lisbon, Portugal. Association for Computational Linguistics.
  23. Improving language understanding by generative pre-training.
  24. Bifixer and bicleaner: two open-source tools to clean your parallel data. In Proceedings of the 22nd Annual Conference of the European Association for Machine Translation, pages 291–298, Lisboa, Portugal. European Association for Machine Translation.
  25. The curious case of hallucinations in neural machine translation. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1172–1183, Online. Association for Computational Linguistics.
  26. COMET-22: Unbabel-IST 2022 submission for the metrics shared task. In Proceedings of the Seventh Conference on Machine Translation (WMT), pages 578–585, Abu Dhabi, United Arab Emirates (Hybrid). Association for Computational Linguistics.
  27. COMET: A neural framework for MT evaluation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2685–2702, Online. Association for Computational Linguistics.
  28. CometKiwi: IST-unbabel 2022 submission for the quality estimation shared task. In Proceedings of the Seventh Conference on Machine Translation (WMT), pages 634–645, Abu Dhabi, United Arab Emirates (Hybrid). Association for Computational Linguistics.
  29. Energy and policy considerations for deep learning in NLP. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3645–3650, Florence, Italy. Association for Computational Linguistics.
  30. Jörg Tiedemann. 2012. Parallel data, tools and interfaces in OPUS. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC’12), pages 2214–2218, Istanbul, Turkey. European Language Resources Association (ELRA).
  31. Llama: Open and efficient foundation language models.
  32. Prompting palm for translation: Assessing strategies and performance.
  33. A paradigm shift in machine translation: Boosting translation performance of large language models.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Duarte M. Alves (7 papers)
  2. Nuno M. Guerreiro (27 papers)
  3. João Alves (84 papers)
  4. José Pombal (15 papers)
  5. Ricardo Rei (34 papers)
  6. José G. C. de Souza (12 papers)
  7. Pierre Colombo (48 papers)
  8. André F. T. Martins (113 papers)
Citations (40)