PromptIntern: Saving Inference Costs by Internalizing Recurrent Prompt during Large Language Model Fine-tuning (2407.02211v2)
Abstract: Recent advances in fine-tuning LLMs have greatly enhanced their usage in domain-specific tasks. Despite the success, fine-tuning continues to rely on repeated and lengthy prompts, which escalate computational expenses, require more resources, and lead to slower inference. In this paper, we present a novel approach, PromptIntern, which internalizes prompt knowledge during model fine-tuning to achieve efficient inference and save costs. Instead of compressing the prompts for a vanilla model, PromptIntern aims to embed the recurrent prompt directly into the model parameters. We design a fine-tuning pipeline that includes instruction template compression, few-shot example absorption, and a progressive internalization strategy, effectively diminishing the need for intricate prompts during inference. Comprehensive experiments on challenging NL2Code tasks demonstrate that our method reduces input tokens by more than 90%, accelerates inference by 4.2 times, and reduces monetary inference costs by 88.3%.
- Program synthesis with large language models. arXiv preprint arXiv:2108.07732.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
- Decoupling knowledge from memorization: Retrieval-augmented prompt learning. In Advances in Neural Information Processing Systems.
- Batch prompting: Efficient inference with large language model apis. arXiv preprint arXiv:2301.08721.
- Promptagator: Few-shot dense retrieval from 8 examples. In The Eleventh International Conference on Learning Representations.
- Qlora: Efficient finetuning of quantized llms. Advances in Neural Information Processing Systems, 36.
- Unified language model pre-training for natural language understanding and generation. Advances in neural information processing systems, 32.
- Towards a unified view of parameter-efficient transfer learning. arXiv preprint arXiv:2110.04366.
- Conline: Complex code generation and refinement with online searching and correctness testing. arXiv preprint arXiv:2403.13583.
- Parameter-efficient transfer learning for nlp. In International conference on machine learning, pages 2790–2799. PMLR.
- Lora: Low-rank adaptation of large language models. In International Conference on Learning Representations.
- Active few-shot fine-tuning. arXiv preprint arXiv:2402.15441.
- Mixtral of experts. arXiv preprint arXiv:2401.04088.
- Llmlingua: Compressing prompts for accelerated inference of large language models. arXiv preprint arXiv:2310.05736.
- Longllmlingua: Accelerating and enhancing llms in long context scenarios via prompt compression. arXiv preprint arXiv:2310.06839.
- The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691.
- Brian Lester and Rami Al-Rfou Noah Constant. The power of scale for parameter-efficient prompt tuning.
- Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459–9474.
- Compressing context to enhance inference efficiency of large language models. arXiv preprint arXiv:2310.06201.
- Nl2bash: A corpus and semantic parser for natural language interface to the linux operating system. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018).
- Dora: Weight-decomposed low-rank adaptation. arXiv preprint arXiv:2402.09353.
- P-tuning: Prompt tuning can be comparable to fine-tuning across scales and tasks. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 61–68.
- Wizardcoder: Empowering code large language models with evol-instruct. Preprint, arXiv:2306.08568.
- Few-shot fine-tuning vs. in-context learning: A fair comparison and evaluation. arXiv preprint arXiv:2305.16938.
- Learning to compress prompts with gist tokens. Advances in Neural Information Processing Systems, 36.
- Retrieval-based prompt selection for code-related few-shot learning. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), pages 2450–2462. IEEE.
- R OpenAI. 2023. Gpt-4 technical report. arxiv 2303.08774. View in Article, 2(5).
- Llmlingua-2: Data distillation for efficient and faithful task-agnostic prompt compression. arXiv preprint arXiv:2403.12968.
- Residual prompt tuning: improving prompt tuning with residual reparameterization. In Findings of the Association for Computational Linguistics: ACL 2023, pages 6740–6757.
- Code llama: Open foundation models for code. Preprint, arXiv:2308.12950.
- Learning to retrieve prompts for in-context learning. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2655–2671.
- S-lora: Serving thousands of concurrent lora adapters. arXiv preprint arXiv:2311.03285.
- Prototypical networks for few-shot learning. Advances in neural information processing systems, 30.
- Tap4llm: Table provider on sampling, augmenting, and packing semi-structured data for large language model reasoning. arXiv preprint arXiv:2312.09039.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
- Attention is all you need. Advances in neural information processing systems, 30.
- Fine tuning llm for enterprise: Practical guidelines and recommendations. arXiv preprint arXiv:2404.10779.
- Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837.
- Prompt compression and contrastive conditioning for controllability and toxicity reduction in language models. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 5621–5634.
- Recomp: Improving retrieval-augmented lms with context compression and selective augmentation. In The Twelfth International Conference on Learning Representations.
- Large language models meet nl2code: A survey. arXiv preprint arXiv:2212.09420.
- Nl2formula: Generating spreadsheet formulas from natural language queries. arXiv preprint arXiv:2402.14853.
- Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592.
- Jiaru Zou (11 papers)
- Mengyu Zhou (24 papers)
- Tao Li (440 papers)
- Shi Han (74 papers)
- Dongmei Zhang (193 papers)