Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
184 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Where does In-context Translation Happen in Large Language Models (2403.04510v1)

Published 7 Mar 2024 in cs.CL and cs.AI

Abstract: Self-supervised LLMs have demonstrated the ability to perform Machine Translation (MT) via in-context learning, but little is known about where the model performs the task with respect to prompt instructions and demonstration examples. In this work, we attempt to characterize the region where LLMs transition from in-context learners to translation models. Through a series of layer-wise context-masking experiments on \textsc{GPTNeo2.7B}, \textsc{Bloom3B}, \textsc{Llama7b} and \textsc{Llama7b-chat}, we demonstrate evidence of a "task recognition" point where the translation task is encoded into the input representations and attention to context is no longer necessary. We further observe correspondence between the low performance when masking out entire layers, and the task recognition layers. Taking advantage of this redundancy results in 45\% computational savings when prompting with 5 examples, and task recognition achieved at layer 14 / 32. Our layer-wise fine-tuning experiments indicate that the most effective layers for MT fine-tuning are the layers critical to task recognition.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (51)
  1. In-context examples selection for machine translation. arXiv preprint arXiv:2212.02437, 2022.
  2. What learning algorithm is in-context learning? investigations with linear models. arXiv preprint arXiv:2211.15661, 2022.
  3. Rethinking the role of scale for in-context learning: An interpretability-based case study at 66 billion scale. In Rogers, A., Boyd-Graber, J., and Okazaki, N. (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  11833–11856, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.660. URL https://aclanthology.org/2023.acl-long.660.
  4. Losing heads in the lottery: Pruning transformer attention in neural machine translation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.  2664–2674, 2020.
  5. Pruning neural machine translation for speed using group lasso. In Proceedings of the sixth conference on machine translation, pp.  1074–1086, 2021.
  6. Nearest class-center simplification through intermediate layers. In Topological, Algebraic and Geometric Learning Workshops 2022, pp.  37–47. PMLR, 2022.
  7. Tart: A plug-and-play transformer module for task-agnostic reasoning, 2023.
  8. GPT-Neo: Large scale autoregressive language modeling with Mesh-Tensorflow, March 2021. URL https://doi.org/10.5281/zenodo.5297715.
  9. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021.
  10. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  11. What does BERT look at? an analysis of BERT’s attention. In Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pp.  276–286, Florence, Italy, August 2019. Association for Computational Linguistics. doi: 10.18653/v1/W19-4828. URL https://aclanthology.org/W19-4828.
  12. Why can gpt learn in-context? language models implicitly perform gradient descent as meta-optimizers. In ICLR 2023 Workshop on Mathematical and Empirical Understanding of Foundation Models, 2023.
  13. How do decisions emerge across layers in neural models? interpretation with differentiable masking. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.  3243–3255, 2020.
  14. On the transformation of latent space in fine-tuned nlp models. arXiv preprint arXiv:2210.12696, 2022.
  15. A practical survey on faster and lighter transformers. ACM Computing Surveys, 55(14s):1–40, 2023.
  16. The Pile: An 800GB dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020.
  17. The unreasonable effectiveness of few-shot learning for machine translation. arXiv preprint arXiv:2302.01398, 2023.
  18. What can transformers learn in-context? a case study of simple function classes. Advances in Neural Information Processing Systems, 35:30583–30598, 2022.
  19. The flores-101 evaluation benchmark for low-resource and multilingual machine translation. 2021.
  20. How good are gpt models at machine translation? a comprehensive evaluation. arXiv preprint arXiv:2302.09210, 2023.
  21. Designing and interpreting probes with control tasks. arXiv preprint arXiv:1909.03368, 2019.
  22. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
  23. Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:1611.01144, 2016.
  24. Koehn, P. Europarl: A parallel corpus for statistical machine translation. In Proceedings of machine translation summit x: papers, pp.  79–86, 2005.
  25. The bigscience roots corpus: A 1.6 tb composite multilingual dataset. Advances in Neural Information Processing Systems, 35:31809–31826, 2022.
  26. The closeness of in-context learning and weight shifting for softmax regression. arXiv preprint arXiv:2304.13276, 2023.
  27. Few-shot learning with multilingual generative language models. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.  9019–9052, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.emnlp-main.616. URL https://aclanthology.org/2022.emnlp-main.616.
  28. What makes good in-context examples for GPT-3? In Proceedings of Deep Learning Inside Out (DeeLIO 2022): The 3rd Workshop on Knowledge Extraction and Integration for Deep Learning Architectures, pp.  100–114, Dublin, Ireland and Online, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.deelio-1.10. URL https://aclanthology.org/2022.deelio-1.10.
  29. Learning sparse neural networks through l⁢_⁢0𝑙_0l\_0italic_l _ 0 regularization. arXiv preprint arXiv:1712.01312, 2017.
  30. Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  8086–8098, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.556. URL https://aclanthology.org/2022.acl-long.556.
  31. The concrete distribution: A continuous relaxation of discrete random variables. arXiv preprint arXiv:1611.00712, 2016.
  32. Are sixteen heads really better than one? Advances in neural information processing systems, 32, 2019.
  33. Adaptive machine translation with large language models. arXiv preprint arXiv:2301.13294, 2023.
  34. Layer-wise analysis of a self-supervised speech representation model. In 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp.  914–921. IEEE, 2021.
  35. Fine-tuned transformers show clusters of similar representations across layers. arXiv preprint arXiv:2109.08406, 2021.
  36. Post, M. A call for clarity in reporting bleu scores. arXiv preprint arXiv:1804.08771, 2018.
  37. On the effect of dropping layers of pre-trained transformer models. Computer Speech & Language, 77:101429, 2023.
  38. Multitask prompted training enables zero-shot task generalization. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=9Vrb9D0WI4.
  39. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100, 2022.
  40. In-context learning as maintaining coherency: A study of on-the-fly machine translation using large language models. arXiv preprint arXiv:2305.03573, 2023.
  41. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  42. The bottom-up evolution of representations in the transformer: A study with machine translation and language modeling objectives. arXiv preprint arXiv:1909.01380, 2019a.
  43. Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp.  5797–5808, Florence, Italy, July 2019b. Association for Computational Linguistics. doi: 10.18653/v1/P19-1580. URL https://aclanthology.org/P19-1580.
  44. Transformers learn in-context by gradient descent, 2023.
  45. Super-NaturalInstructions: Generalization via declarative instructions on 1600+ NLP tasks. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.  5085–5109, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.emnlp-main.340. URL https://aclanthology.org/2022.emnlp-main.340.
  46. Chain of thought prompting elicits reasoning in large language models. In Oh, A. H., Agarwal, A., Belgrave, D., and Cho, K. (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=_VjQlMeSB_J.
  47. Larger language models do in-context learning differently, 2023.
  48. The learnability of in-context learning. arXiv preprint arXiv:2303.07895, 2023.
  49. Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771, 2019.
  50. Hidden state variability of pretrained language models can guide computation reduction for transfer learning. arXiv preprint arXiv:2210.10041, 2022.
  51. An explanation of in-context learning as implicit bayesian inference. arXiv preprint arXiv:2111.02080, 2021.

Summary

We haven't generated a summary for this paper yet.