Towards Anytime Fine-tuning: Continually Pre-trained Language Models with Hypernetwork Prompt (2310.13024v1)
Abstract: Continual pre-training has been urgent for adapting a pre-trained model to a multitude of domains and tasks in the fast-evolving world. In practice, a continually pre-trained model is expected to demonstrate not only greater capacity when fine-tuned on pre-trained domains but also a non-decreasing performance on unseen ones. In this work, we first investigate such anytime fine-tuning effectiveness of existing continual pre-training approaches, concluding with unanimously decreased performance on unseen domains. To this end, we propose a prompt-guided continual pre-training method, where we train a hypernetwork to generate domain-specific prompts by both agreement and disagreement losses. The agreement loss maximally preserves the generalization of a pre-trained model to new domains, and the disagreement one guards the exclusiveness of the generated hidden states for each domain. Remarkably, prompts by the hypernetwork alleviate the domain identity when fine-tuning and promote knowledge transfer across domains. Our method achieved improvements of 3.57% and 3.4% on two real-world datasets (including domain shift and temporal shift), respectively, demonstrating its efficacy.
- Memory aware synapses: Learning what (not) to forget. In Proceedings of the European Conference on Computer Vision (ECCV), pages 139–154.
- The evolution of out-of-distribution robustness throughout fine-tuning. Transactions on Machine Learning Research.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
- Dark experience for general continual learning: a strong, simple baseline. Advances in neural information processing systems, 33:15920–15930.
- Co2l: Contrastive continual learning. In Proceedings of the IEEE/CVF International conference on computer vision, pages 9516–9525.
- Efficient lifelong learning with A-GEM. In ICLR (Poster). OpenReview.net.
- Knowledge distillation with the reused teacher classifier. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11933–11942.
- Lifelong language knowledge distillation. In EMNLP (1), pages 2914–2924. Association for Computational Linguistics.
- A continual learning survey: Defying forgetting in classification tasks. IEEE transactions on pattern analysis and machine intelligence, 44(7):3366–3385.
- Episodic memory in lifelong language learning. Advances in Neural Information Processing Systems, 32.
- Should we be pre-training? an argument for end-task aware training as an alternative. In International Conference on Learning Representations.
- BERT: pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT (1), pages 4171–4186. Association for Computational Linguistics.
- A holistic lexicon-based approach to opinion mining. In WSDM, pages 231–240. ACM.
- Few-shot class-incremental learning via relation knowledge distillation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 1255–1263.
- How catastrophic can catastrophic forgetting be in linear regression? In COLT, volume 178 of Proceedings of Machine Learning Research, pages 4028–4079. PMLR.
- Mosaicking to distill: Knowledge distillation from out-of-domain data. Advances in Neural Information Processing Systems, 34:11920–11932.
- Orthogonal gradient descent for continual learning. In AISTATS, volume 108 of Proceedings of Machine Learning Research, pages 3762–3773. PMLR.
- Hashtag recommendation using attention-based convolutional neural network. In IJCAI, pages 2782–2788. IJCAI/AAAI Press.
- Don’t stop pretraining: Adapt language models to domains and tasks. In ACL, pages 8342–8360. Association for Computational Linguistics.
- Continual learning in predictive autoscaling. In CIKM.
- Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531.
- Parameter-efficient transfer learning for NLP. In ICML, volume 97 of Proceedings of Machine Learning Research, pages 2790–2799. PMLR.
- How well does self-supervised pre-training perform with streaming data? In ICLR. OpenReview.net.
- Lora: Low-rank adaptation of large language models. In ICLR. OpenReview.net.
- Continual learning for text classification with information disentanglement based regularization. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2736–2746.
- Towards continual knowledge learning of language models. In ICLR. OpenReview.net.
- Learn continually, generalize rapidly: Lifelong knowledge accumulation for few-shot learning. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 714–729.
- Lifelong pretraining: Continually adapting language models to emerging corpora. In NAACL-HLT, pages 4764–4780. Association for Computational Linguistics.
- Measuring the evolution of a scientific field through citation frames. Trans. Assoc. Comput. Linguistics, 6:391–406.
- Forget-free continual learning with winning subnetworks. In International Conference on Machine Learning, pages 10734–10750. PMLR.
- Continual training of language models for few-shot learning. arXiv preprint arXiv:2210.05549.
- Classic: Continual and contrastive learning of aspect sentiment classification tasks. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 6871–6883.
- Continual learning of language models. In ICLR. OpenReview.net.
- Adapting a language model while preserving its general knowledge. In EMNLP, pages 10177–10188. Association for Computational Linguistics.
- Adapting bert for continual learning of a sequence of aspect sentiment classification tasks. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4746–4755.
- Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences, 114(13):3521–3526.
- Chemprot-3.0: a global chemical biology diseases mapping. Database J. Biol. Databases Curation, 2016.
- The power of scale for parameter-efficient prompt tuning. In EMNLP (1), pages 3045–3059. Association for Computational Linguistics.
- Learn to grow: A continual structure learning framework for overcoming catastrophic forgetting. In International Conference on Machine Learning, pages 3925–3934. PMLR.
- Zhizhong Li and Derek Hoiem. 2017. Learning without forgetting. IEEE transactions on pattern analysis and machine intelligence, 40(12):2935–2947.
- P-tuning: Prompt tuning can be comparable to fine-tuning across scales and tasks. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 61–68.
- Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
- S2ORC: the semantic scholar open research corpus. In ACL, pages 4969–4983. Association for Computational Linguistics.
- Multi-task identification of entities, relations, and coreference for scientific knowledge graph construction. In EMNLP, pages 3219–3232. Association for Computational Linguistics.
- Michael McCloskey and Neal J Cohen. 1989. Catastrophic interference in connectionist networks: The sequential learning problem. In Psychology of learning and motivation, volume 24, pages 109–165. Elsevier.
- Mass-editing memory in a transformer. In The Eleventh International Conference on Learning Representations.
- Learning compatible embeddings. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9939–9948.
- Bertweet: A pre-trained language model for english tweets. In EMNLP (Demos), pages 9–14. Association for Computational Linguistics.
- Justifying recommendations using distantly-labeled reviews and fine-grained aspects. In EMNLP/IJCNLP (1), pages 188–197. Association for Computational Linguistics.
- Agree to disagree: Diversity through disagreement for better transferability. In The Eleventh International Conference on Learning Representations.
- Few-shot image recognition by predicting parameters from activations. In CVPR, pages 7229–7238. Computer Vision Foundation / IEEE Computer Society.
- Chengwei Qin and Shafiq R. Joty. 2022. LFPT5: A unified framework for lifelong few-shot language learning based on prompt tuning of T5. In ICLR. OpenReview.net.
- Elle: Efficient lifelong pre-training for emerging data. In Findings of the Association for Computational Linguistics: ACL 2022, pages 2789–2810.
- Pre-trained models for natural language processing: A survey. CoRR, abs/2003.08271.
- Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
- Progressive prompts: Continual learning for language models. In ICLR. OpenReview.net.
- Overcoming catastrophic forgetting with hard attention to the task. In International Conference on Machine Learning, pages 4548–4557. PMLR.
- Coda-prompt: Continual decomposed attention-based prompting for rehearsal-free continual learning. In CVPR. IEEE.
- Lamol: Language modeling for lifelong language learning. arXiv preprint arXiv:1909.03329.
- Information-theoretic online memory selection for continual learning. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net.
- Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing high-dimensional data using t-sne. Journal of Machine Learning Research, 9:2579–2605.
- Attention is all you need. In NIPS, pages 5998–6008.
- Spot: Better frozen model adaptation through soft prompt transfer. In ACL (1), pages 5039–5059. Association for Computational Linguistics.
- Transprompt: Towards an automatic transferable prompting framework for few-shot text classification. In EMNLP (1), pages 2792–2802. Association for Computational Linguistics.
- Training networks in null space of feature covariance for continual learning. In CVPR, pages 184–193. Computer Vision Foundation / IEEE.
- Dualprompt: Complementary prompting for rehearsal-free continual learning. In ECCV (26), volume 13686 of Lecture Notes in Computer Science, pages 631–648. Springer.
- Learning to prompt for continual learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 139–149.
- Learning to prompt for continual learning. In CVPR, pages 139–149. IEEE.
- Continual world: A robotic benchmark for continual reinforcement learning. Advances in Neural Information Processing Systems, 34:28496–28510.
- Disentangling transfer in continual reinforcement learning. Advances in Neural Information Processing Systems, 35:6304–6317.
- Robust fine-tuning of zero-shot models. In CVPR, pages 7949–7961. IEEE.
- Knowledge distillation meets self-supervision. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IX, pages 588–604. Springer.
- BERT post-training for review reading comprehension and aspect-based sentiment analysis. In NAACL-HLT (1), pages 2324–2335. Association for Computational Linguistics.
- Prompt-augmented temporal point process for streaming event sequence. In Advances in Neural Information Processing Systems.
- Hierarchically structured meta-learning. In International Conference on Machine Learning, pages 7045–7054. PMLR.
- Scalable and order-robust continual learning with additive parameter decomposition. In ICLR. OpenReview.net.
- Continual learning through synaptic intelligence. In International Conference on Machine Learning, pages 3987–3995. PMLR.
- Continual prompt tuning for dialog state tracking. In ACL (1), pages 1124–1137. Association for Computational Linguistics.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.