Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SelfCP: Compressing Over-Limit Prompt via the Frozen Large Language Model Itself (2405.17052v2)

Published 27 May 2024 in cs.CL

Abstract: Long prompt leads to huge hardware costs when using transformer-based LLMs. Unfortunately, many tasks, such as summarization, inevitably introduce long documents, and the wide application of in-context learning easily makes the prompt length explode. This paper proposes a Self-Compressor (SelfCP), which employs the target LLM itself to compress over-limit prompts into dense vectors while keeping the allowed prompts unmodified. Dense vectors are then projected into dense tokens via a learnable connector to make the same LLM unburden to understand. The connector is supervised-tuned under the LLMing objective of the LLM on relatively long texts selected from publicly accessed datasets, involving an instruction dataset to make SelfCP respond to various prompts, while the target LLM keeps frozen during training. We build the lightweight SelfCP upon 2 different backbones with merely 17M learnable parameters originating from the connector and a learnable embedding. Evaluation on both English and Chinese benchmarks demonstrate that SelfCP effectively substitutes 12$\times$ over-limit prompts with dense tokens to reduce memory costs and booster inference throughputs, yet improving response quality. The outstanding performance brings an efficient solution for LLMs to tackle long prompts without training LLMs from scratch.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (36)
  1. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150.
  2. Unlimiformer: Long-range transformers with unlimited length input. arXiv preprint arXiv:2305.01625.
  3. Token merging: Your vit but faster. arXiv preprint arXiv:2210.09461.
  4. Recurrent memory transformer. Advances in Neural Information Processing Systems, 35:11079–11091.
  5. Adapting language models to compress contexts. arXiv preprint arXiv:2305.14788.
  6. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509.
  7. Rethinking attention with performers. arXiv preprint arXiv:2009.14794.
  8. Leveraging duc. In proceedings of DUC.
  9. Transformer-xl: Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860.
  10. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
  11. In-context autoencoder for context compression in a large language model. arXiv preprint arXiv:2307.06945.
  12. Cicero: A dataset for contextualized commonsense inference in dialogues. arXiv preprint arXiv:2203.13926.
  13. Llmlingua: Compressing prompts for accelerated inference of large language models. arXiv preprint arXiv:2310.05736.
  14. Transformers are rnns: Fast autoregressive transformers with linear attention. In International conference on machine learning, pages 5156–5165. PMLR.
  15. Learned token pruning for transformers. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 784–794.
  16. Fnet: Mixing tokens with fourier transforms. arXiv preprint arXiv:2105.03824.
  17. Xiang Lisa Li and Percy Liang. 2021. Prefix-tuning: Optimizing continuous prompts for generation. arXiv preprint arXiv:2101.00190.
  18. Yucheng Li. 2023. Unlocking context constraints of llms: Enhancing context efficiency of llms with self-information-based content filtering. arXiv preprint arXiv:2304.12102.
  19. Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81.
  20. P-tuning: Prompt tuning can be comparable to fine-tuning across scales and tasks. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 61–68.
  21. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
  22. Learning to compress prompts with gist tokens. arXiv preprint arXiv:2304.08467.
  23. Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization. ArXiv, abs/1808.08745.
  24. Stanford alpaca: An instruction-following llama model.
  25. BlueLM Team. 2023. Bluelm: An open multilingual 7b language model. https://github.com/vivo-ai-lab/BlueLM.
  26. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
  27. Is chatgpt a good nlg evaluator? a preliminary study. arXiv preprint arXiv:2303.04048.
  28. Super-naturalinstructions: Generalization via declarative instructions on 1600+ nlp tasks. arXiv preprint arXiv:2204.07705.
  29. Is chatgpt a good sentiment analyzer? a preliminary study. arXiv preprint arXiv:2304.04339.
  30. Zero-shot information extraction via chatting with chatgpt. arXiv preprint arXiv:2302.10205.
  31. Prompt compression and contrastive conditioning for controllability and toxicity reduction in language models. arXiv preprint arXiv:2210.03162.
  32. Recursively summarizing books with human feedback. arXiv preprint arXiv:2109.10862.
  33. Exploring the limits of chatgpt for query or aspect-based text summarization. arXiv preprint arXiv:2302.08081.
  34. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068.
  35. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685.
  36. Training language models with memory augmentation. arXiv preprint arXiv:2205.12674.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Jun Gao (267 papers)
  2. Ziqiang Cao (34 papers)
  3. Wenjie Li (183 papers)
Citations (3)
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets