Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

PromptIntern: Saving Inference Costs by Internalizing Recurrent Prompt during Large Language Model Fine-tuning (2407.02211v2)

Published 2 Jul 2024 in cs.CL, cs.AI, and cs.LG

Abstract: Recent advances in fine-tuning LLMs have greatly enhanced their usage in domain-specific tasks. Despite the success, fine-tuning continues to rely on repeated and lengthy prompts, which escalate computational expenses, require more resources, and lead to slower inference. In this paper, we present a novel approach, PromptIntern, which internalizes prompt knowledge during model fine-tuning to achieve efficient inference and save costs. Instead of compressing the prompts for a vanilla model, PromptIntern aims to embed the recurrent prompt directly into the model parameters. We design a fine-tuning pipeline that includes instruction template compression, few-shot example absorption, and a progressive internalization strategy, effectively diminishing the need for intricate prompts during inference. Comprehensive experiments on challenging NL2Code tasks demonstrate that our method reduces input tokens by more than 90%, accelerates inference by 4.2 times, and reduces monetary inference costs by 88.3%.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (43)
  1. Program synthesis with large language models. arXiv preprint arXiv:2108.07732.
  2. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  3. Decoupling knowledge from memorization: Retrieval-augmented prompt learning. In Advances in Neural Information Processing Systems.
  4. Batch prompting: Efficient inference with large language model apis. arXiv preprint arXiv:2301.08721.
  5. Promptagator: Few-shot dense retrieval from 8 examples. In The Eleventh International Conference on Learning Representations.
  6. Qlora: Efficient finetuning of quantized llms. Advances in Neural Information Processing Systems, 36.
  7. Unified language model pre-training for natural language understanding and generation. Advances in neural information processing systems, 32.
  8. Towards a unified view of parameter-efficient transfer learning. arXiv preprint arXiv:2110.04366.
  9. Conline: Complex code generation and refinement with online searching and correctness testing. arXiv preprint arXiv:2403.13583.
  10. Parameter-efficient transfer learning for nlp. In International conference on machine learning, pages 2790–2799. PMLR.
  11. Lora: Low-rank adaptation of large language models. In International Conference on Learning Representations.
  12. Active few-shot fine-tuning. arXiv preprint arXiv:2402.15441.
  13. Mixtral of experts. arXiv preprint arXiv:2401.04088.
  14. Llmlingua: Compressing prompts for accelerated inference of large language models. arXiv preprint arXiv:2310.05736.
  15. Longllmlingua: Accelerating and enhancing llms in long context scenarios via prompt compression. arXiv preprint arXiv:2310.06839.
  16. The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691.
  17. Brian Lester and Rami Al-Rfou Noah Constant. The power of scale for parameter-efficient prompt tuning.
  18. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459–9474.
  19. Compressing context to enhance inference efficiency of large language models. arXiv preprint arXiv:2310.06201.
  20. Nl2bash: A corpus and semantic parser for natural language interface to the linux operating system. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018).
  21. Dora: Weight-decomposed low-rank adaptation. arXiv preprint arXiv:2402.09353.
  22. P-tuning: Prompt tuning can be comparable to fine-tuning across scales and tasks. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 61–68.
  23. Wizardcoder: Empowering code large language models with evol-instruct. Preprint, arXiv:2306.08568.
  24. Few-shot fine-tuning vs. in-context learning: A fair comparison and evaluation. arXiv preprint arXiv:2305.16938.
  25. Learning to compress prompts with gist tokens. Advances in Neural Information Processing Systems, 36.
  26. Retrieval-based prompt selection for code-related few-shot learning. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), pages 2450–2462. IEEE.
  27. R OpenAI. 2023. Gpt-4 technical report. arxiv 2303.08774. View in Article, 2(5).
  28. Llmlingua-2: Data distillation for efficient and faithful task-agnostic prompt compression. arXiv preprint arXiv:2403.12968.
  29. Residual prompt tuning: improving prompt tuning with residual reparameterization. In Findings of the Association for Computational Linguistics: ACL 2023, pages 6740–6757.
  30. Code llama: Open foundation models for code. Preprint, arXiv:2308.12950.
  31. Learning to retrieve prompts for in-context learning. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2655–2671.
  32. S-lora: Serving thousands of concurrent lora adapters. arXiv preprint arXiv:2311.03285.
  33. Prototypical networks for few-shot learning. Advances in neural information processing systems, 30.
  34. Tap4llm: Table provider on sampling, augmenting, and packing semi-structured data for large language model reasoning. arXiv preprint arXiv:2312.09039.
  35. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  36. Attention is all you need. Advances in neural information processing systems, 30.
  37. Fine tuning llm for enterprise: Practical guidelines and recommendations. arXiv preprint arXiv:2404.10779.
  38. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837.
  39. Prompt compression and contrastive conditioning for controllability and toxicity reduction in language models. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 5621–5634.
  40. Recomp: Improving retrieval-augmented lms with context compression and selective augmentation. In The Twelfth International Conference on Learning Representations.
  41. Large language models meet nl2code: A survey. arXiv preprint arXiv:2212.09420.
  42. Nl2formula: Generating spreadsheet formulas from natural language queries. arXiv preprint arXiv:2402.14853.
  43. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Jiaru Zou (11 papers)
  2. Mengyu Zhou (24 papers)
  3. Tao Li (440 papers)
  4. Shi Han (74 papers)
  5. Dongmei Zhang (193 papers)
Citations (6)
X Twitter Logo Streamline Icon: https://streamlinehq.com