This paper investigates methods for adapting LLMs, specifically LLaMA 7B and 13B, for Machine Translation (MT), comparing few-shot prompting, traditional full finetuning, and parameter-efficient finetuning using Low-Rank Adaptation (LoRA). It aims to find a balance between translation quality, computational efficiency, and the ability to adapt using in-context examples (Alves et al., 2023 ).
Key Findings and Implementation Details:
- Efficient Finetuning with LoRA:
- LoRA proves highly effective for specializing LLMs for MT. It matches the translation quality (measured by COMET, BLEU, etc.) of traditional full finetuning while training significantly fewer parameters (reportedly 50x fewer for LLaMA 7B: 134M vs 6.7B).
- Practical Implication: This drastically reduces the computational resources (GPU memory, time) needed for finetuning, making it more accessible. You can achieve strong MT performance without needing to update the entire LLM.
- LoRA-finetuned models outperform few-shot prompting on pretrained LLMs, even when finetuned on relatively small amounts of parallel data (as few as 2,000 examples showed gains).
- Finetuning (both full and LoRA) inherently resolves the "overgeneration" issue common with prompted pretrained LLMs, where models continue generating text beyond the translation. Finetuned models learn to stop appropriately (using EOS token), eliminating the need for manual post-processing like cutting off text after the first newline character (See Figure 4, Appendix F).
- Degradation of Few-Shot Capability after Standard Finetuning:
- A significant drawback found is that standard finetuning (even efficient LoRA finetuning) degrades the LLM's ability to utilize few-shot examples provided at inference time.
- Observation: When a model finetuned without few-shot examples in its training data is given few-shot examples during inference, its performance often drops below its zero-shot performance.
- Impact: This hinders the model's ability to adapt on-the-fly to specific domains, styles, or terminology using provided examples, which is a key advantage of LLMs. The degradation was observed across general (Flores, WMT) and specialized domains (Medical, Law, Tico, Chat) (See Figure 3).
- Proposed Solution: Finetuning with Few-Shot Examples:
- To address the degradation, the paper proposes a simple yet effective method: include few-shot examples during the finetuning process.
- Implementation: Modify the training data format. For each training instance, randomly sample between 0 and 5 relevant translation examples (from a held-out pool) and format them into the instruction prompt along with the source sentence to be translated. Finetuning then proceeds using LoRA on this mixed data (zero-shot and few-shot instructions).
- Prompt Format: Appendix A details prompt templates. A successful format separates the examples section from the final translation task instruction (See Table 3, Format 2).
1 2 3 4 5 6 7 8 9 10 11 12
Consider the following N translations from X to Y. Example 1 Source: ... Target: ... ... Example N Source: ... Target: ... Translate the source text from X to Y. Source: ... Target: [Model generates this]
- Result: This "finetuning with few-shot examples" approach successfully recovers the model's in-context learning ability. When evaluated, providing few-shot examples at inference time improves performance over zero-shot, even on specialized domains (See Figure 3).
- Benefit: This method achieves the "best of both worlds": the high zero-shot performance and improved output formatting from finetuning, combined with the adaptive capabilities of in-context learning.
- Analysis of In-Context Example Influence:
- While the proposed finetuning method restores the average benefit of few-shot examples, the effect per instance varies.
- Pros: Few-shot examples can correct significant errors, such as translating into the wrong language (See Table 9, Appendix E).
- Cons: Sometimes, providing few-shot examples (even to the model finetuned with examples) can introduce hallucinations or degrade an otherwise correct zero-shot translation (See Table 10, Appendix E).
- Mitigation: Finetuning with few-shot examples significantly reduces the rate of these few-shot-induced hallucinations compared to models finetuned without examples, but does not eliminate them completely (See Table 1). This suggests the training helps the model become more robust but not perfectly immune to misleading examples.
Practical Summary for Implementation:
- Use LoRA for efficiently finetuning LLMs like LLaMA for MT tasks. It offers performance comparable to full finetuning at a fraction of the computational cost.
- Standard LoRA finetuning improves zero-shot MT quality and fixes overgeneration but harms the ability to adapt using few-shot examples at inference.
- To retain adaptability, mix few-shot examples into the LoRA finetuning data. Randomly include 0-5 examples in the instruction prompt for each training instance.
- This combined approach yields a model that performs well zero-shot, doesn't overgenerate, and can effectively leverage few-shot examples provided at inference time for domain/style adaptation.
- Be aware that even with this improved finetuning, providing few-shot examples at inference can occasionally introduce errors or hallucinations, although the rate is reduced.
The paper provides detailed hyperparameters (Appendix A.2) and evaluation results across multiple metrics and datasets (Appendix G) for LLaMA 7B and 13B on 10 language pairs (mostly English-centric).