PromptIntern: Saving Inference Costs by Internalizing Recurrent Prompt during Large Language Model Fine-tuning (2407.02211v2)

Published 2 Jul 2024 in cs.CL, cs.AI, and cs.LG

Abstract: Recent advances in fine-tuning LLMs have greatly enhanced their usage in domain-specific tasks. Despite the success, fine-tuning continues to rely on repeated and lengthy prompts, which escalate computational expenses, require more resources, and lead to slower inference. In this paper, we present a novel approach, PromptIntern, which internalizes prompt knowledge during model fine-tuning to achieve efficient inference and save costs. Instead of compressing the prompts for a vanilla model, PromptIntern aims to embed the recurrent prompt directly into the model parameters. We design a fine-tuning pipeline that includes instruction template compression, few-shot example absorption, and a progressive internalization strategy, effectively diminishing the need for intricate prompts during inference. Comprehensive experiments on challenging NL2Code tasks demonstrate that our method reduces input tokens by more than 90%, accelerates inference by 4.2 times, and reduces monetary inference costs by 88.3%.

PDF Abstract

Summarize Bookmark Chat (Pro)

References (43)

Authors (5)

Jiaru Zou (11 papers)
Mengyu Zhou (24 papers)
Tao Li (440 papers)
Shi Han (74 papers)
Dongmei Zhang (193 papers)

Citations (6)

View on Semantic Scholar

Tweets

https://twitter.com/Jiaru_Zou/status/1850717558032240995

PromptIntern: Saving Inference Costs by Internalizing Recurrent Prompt during Large Language Model Fine-tuning (2407.02211v2)

Related Papers

Tweets