Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Examining Forgetting in Continual Pre-training of Aligned Large Language Models (2401.03129v1)

Published 6 Jan 2024 in cs.CL
Examining Forgetting in Continual Pre-training of Aligned Large Language Models

Abstract: Recent advances in LLMs have exhibited remarkable proficiency across various tasks. Given the potent applications of LLMs in numerous fields, there has been a surge in LLM development. In developing LLMs, a common practice involves continual pre-training on previously fine-tuned models. However, this can lead to catastrophic forgetting. In our work, we investigate the phenomenon of forgetting that occurs during continual pre-training on an existing fine-tuned LLM. We evaluate the impact of continuous pre-training on the fine-tuned LLM across various dimensions, including output format, knowledge, and reliability. Experiment results highlight the non-trivial challenge of addressing catastrophic forgetting during continual pre-training, especially the repetition issue.

Examining Forgetting in Continual Pre-training of Aligned LLMs

This paper investigates the phenomenon of catastrophic forgetting during the continual pre-training of fine-tuned LLMs. The research highlights both practical and theoretical implications in developing LLMs, with a specific focus on the impact of continual pre-training using a Traditional Chinese corpus.

Context and Motivation

As the capabilities of LLMs advance, there is a significant increase in the release of pre-trained and fine-tuned variants. These models often undergo further pre-training, frequently followed by alignment operations such as Supervised Fine-Tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF). Despite its potential benefits, this continual pre-training can lead to catastrophic forgetting, where the model loses previously acquired capabilities.

Methodology

The paper centers on Llama-2-7b-chat, a model fine-tuned with various alignment techniques, and evaluates the influence of additional pre-training on this architecture. By examining dimensions like output format, knowledge, and reliability, the authors explore different techniques to mitigate forgetting:

  1. Freeze Layers: Selectively freezing the first or last ten layers of the model.
  2. Freeze Modules: Experimenting with freezing specific modules, such as self-attention or feed-forward modules, to preserve previously acquired knowledge.
  3. Adapters: Incorporating additional trainable elements like Lora and (Ia)3^3 to facilitate parameter-efficient continual pre-training.

Experimental Setup

The investigation uses a 1 billion token dataset in Traditional Chinese for further pre-training. For evaluation, the model's performance is analyzed through various tasks:

  • Language identification and repetition analysis to assess the output format.
  • Benchmarks such as ARC, Hellaswag, MMLU, and C-eval-tw for knowledge assessment.
  • Truthfulness, toxicity, and bias metrics to evaluate reliability.

Results

Output Format

The analysis reveals that models pre-trained with the Traditional Chinese corpus often face increased repetition issues, particularly when generating Chinese outputs. Furthermore, models treated with specific adaptation techniques affect the proportion of language output, with differential impacts based on prompt language.

Knowledge

Most continual pre-trained models show improved or maintained performance compared to their non-pre-trained counterparts in ARC and Hellaswag, but subtle differences arise in other benchmarks like C-eval-tw.

Reliability

Models undergoing continual pre-training exhibit a decline in reliability metrics compared to the original Llama-2-7b-chat, particularly in categories like truthfulness and toxicity. This raises concerns about maintaining alignment safety when further tuning LLMs.

Implications

The findings indicate that while continual pre-training can enhance certain aspects of knowledge retention, it poses challenges in output quality and reliability. The repetition problem particularly emphasizes the trade-off when generating outputs in languages that were the focus of continual pre-training.

Future Directions

The paper suggests several avenues for future work, such as exploring pre-training with multilingual datasets and developing methodologies to ensure safety alignment. As the deployment of LLMs becomes more pervasive, these considerations will be crucial for balancing performance with the necessity for reliable and safe outputs.

This paper makes a meaningful contribution to understanding the nuances of continual pre-training in LLMs, offering insights into mitigating the risks of catastrophic forgetting and highlighting areas needing further research and technological advancement.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (37)
  1. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  2. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457.
  3. Efficient and effective text encoding for chinese llama and alpaca. arXiv preprint arXiv:2304.08177.
  4. Bold: Dataset and metrics for measuring biases in open-ended language generation. In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency, pages 862–872.
  5. Kawin Ethayarajh. 2019. How contextual are contextualized word representations? Comparing the geometry of BERT, ELMo, and GPT-2 embeddings. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 55–65, Hong Kong, China. Association for Computational Linguistics.
  6. Robert M French. 1999. Catastrophic forgetting in connectionist networks. Trends in cognitive sciences, 3(4):128–135.
  7. A framework for few-shot language model evaluation.
  8. Continual pre-training of large language models: How to re-warm your model? In Workshop on Efficient Systems for Foundation Models @ ICML2023.
  9. ToxiGen: A large-scale machine-generated dataset for adversarial and implicit hate speech detection. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3309–3326, Dublin, Ireland. Association for Computational Linguistics.
  10. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300.
  11. An empirical study of metrics to measure representational harms in pre-trained language models. In Proceedings of the 3rd Workshop on Trustworthy Natural Language Processing (TrustNLP 2023), pages 121–134, Toronto, Canada. Association for Computational Linguistics.
  12. Parameter-efficient transfer learning for nlp. In International Conference on Machine Learning, pages 2790–2799. PMLR.
  13. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations.
  14. C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models. arXiv preprint arXiv:2305.08322.
  15. Clayton Hutto and Eric Gilbert. 2014. Vader: A parsimonious rule-based model for sentiment analysis of social media text. In Proceedings of the international AAAI conference on web and social media, volume 8, pages 216–225.
  16. Fasttext.zip: Compressing text classification models. arXiv preprint arXiv:1612.03651.
  17. Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759.
  18. Continual pre-training of language models. In The Eleventh International Conference on Learning Representations.
  19. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles.
  20. TruthfulQA: Measuring how models mimic human falsehoods. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3214–3252, Dublin, Ireland. Association for Computational Linguistics.
  21. Yen-Ting Lin and Yun-Nung Chen. 2023. Taiwan llm: Bridging the linguistic divide with a culturally aligned language model.
  22. Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning. Advances in Neural Information Processing Systems, 35:1950–1965.
  23. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
  24. Fine-tuning aligned language models compromises safety, even when users do not intend to! arXiv preprint arXiv:2310.03693.
  25. When and why are pre-trained word embeddings useful for neural machine translation? In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 529–535, New Orleans, Louisiana. Association for Computational Linguistics.
  26. ELLE: Efficient lifelong pre-training for emerging data. In Findings of the Association for Computational Linguistics: ACL 2022, pages 2789–2810, Dublin, Ireland. Association for Computational Linguistics.
  27. Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 3505–3506.
  28. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950.
  29. Conpet: Continual parameter-efficient tuning for large language models. arXiv preprint arXiv:2309.14763.
  30. Jörg Tiedemann. 2012. Parallel data, tools and interfaces in opus. In Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12), Istanbul, Turkey. European Language Resources Association (ELRA).
  31. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
  32. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  33. How does bert answer questions? a layer-wise analysis of transformer representations. In Proceedings of the 28th ACM international conference on information and knowledge management, pages 1823–1832.
  34. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144.
  35. Efficient continual pre-training for building domain specific large language models. arXiv preprint arXiv:2311.08545.
  36. HellaSwag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791–4800, Florence, Italy. Association for Computational Linguistics.
  37. Investigating the catastrophic forgetting in multimodal large language models. arXiv preprint arXiv:2309.10313.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Chen-An Li (13 papers)
  2. Hung-yi Lee (325 papers)
Citations (5)