Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
72 tokens/sec
GPT-4o
61 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

LLoCO: Learning Long Contexts Offline (2404.07979v2)

Published 11 Apr 2024 in cs.CL, cs.AI, and cs.LG
LLoCO: Learning Long Contexts Offline

Abstract: Processing long contexts remains a challenge for LLMs due to the quadratic computational and memory overhead of the self-attention mechanism and the substantial KV cache sizes during generation. We propose LLoCO, a novel approach to address this problem by learning contexts offline through context compression and in-domain parameter-efficient finetuning with LoRA. Our method enables an LLM to create a concise representation of the original context and efficiently retrieve relevant information to answer questions accurately. Our approach extends the effective context window of a 4k token LLaMA2-7B model to handle up to 128k tokens. We evaluate our approach on several long-context question-answering datasets, demonstrating that LLoCO significantly outperforms in-context learning while using $30\times$ fewer tokens during inference. LLoCO achieves up to $7.62\times$ speed-up during inference and $11.52\times$ higher throughput during finetuning, substantially reduces the cost of long document question answering. This makes it a promising solution for efficient long context processing. Our code is publicly available on https://github.com/jeffreysijuntan/lloco.

Extending LLMs' Capacity for Long-Context Tasks via LLoCO

Introduction to LLoCO

The continual growth of LLMs has heralded significant advancements in understanding and generating human-like text. These models hold particular promise for tasks requiring comprehension of extensive documents, such as long document question answering (QA). However, the native limitations of LLMs, marked by their inability to process lengthy texts beyond a few thousand tokens due to quadratic computational overheads, have posed noteworthy challenges. Addressing this, a paper introduces LLoCO (Learning Long Contexts Offline), a novel pipeline designed to significantly extend the effective context window of LLMs, specifically demonstrated on a LLaMA2-7B model.

LLoCO's Approach to Long-Context Processing

LLoCO's methodology is underpinned by three core strategies: context compression, retrieval, and parameter-efficient finetuning. Here's a detailed breakdown of how each component contributes to the pipeline:

  1. Context Compression: The approach begins by encoding extensive texts into denser, more manageable representations. This compression is achieved through a context encoder, which processes the original context and produces a set of summary embeddings that encapsulate the key information in a much-reduced form.
  2. Retrieval Mechanism: Useful for long-context QA, this facet involves retrieving compressed document representations pertinent to the user's query. It highlights LLoCO's ability to efficiently navigate and leverage concise context representations during the inference phase.
  3. Parameter-Efficient Finetuning: Post-compression, LLoCO employs Low-Rank Adaptation (LoRA) to finetune the model in a manner that's both effective and frugal in parameter adjustments. This step is crucial for refining the model's ability to accurately interpret and utilize the compressed contexts.

The combination of these strategies enables LLaMA2-7B to manage up to 128k tokens effectively, a considerable leap from its original 4k token window. Notably, LLoCO achieves this extension while significantly outclassing in-context learning efficiency, using 30×30\times fewer tokens during inference.

Empirical Results

The paper presents a compelling empirical evaluation across several long-context QA datasets. When applied to LLaMA2-7B, LLoCO consistently delivered superior performance, markedly surpassing the baseline performances of models without context and those utilizing traditional in-context learning or retrieval-based methods. Specifically, for the NarrativeQA dataset, LLoCO demonstrated an impressive ability to handle contexts averaging 84,770 tokens, achieving high F1 scores by compressing these contexts into roughly 2,600 tokens.

Theoretical and Practical Implications

LLoCO's innovative approach opens new avenues for enhancing LLMs' performance on long-context tasks. Theoretically, it provides a novel framework that decouples the model's comprehension capacity from the traditionally linear constraints posed by context length. This paves the way for future research into more efficient and effective context processing methods. Practically, the demonstrated ability to significantly speed up inference while reducing computational costs has extensive implications for deploying LLMs in real-world applications where long-context processing is essential.

Future Directions

While LLoCO marks a significant step forward, the paper also acknowledges the scope for further enhancements. Future research might explore optimizing context compression techniques to improve the quality and efficiency of compressed representations. Additionally, advancing parameter-efficient finetuning methods could further refine the models' ability to extract and leverage knowledge from compressed contexts. Lastly, integrating LLoCO with emerging LLM architectures could unlock synergies, amplifying their long-context processing capabilities.

Conclusion

In summary, LLoCO presents a robust and efficient solution to the persistent challenge of long-context processing in LLMs. By marrying context compression with intelligent retrieval and finetuning strategies, it not only extends the effective context window of existing models but also sets a benchmark for future innovations in the field of generative AI and LLMs. The open-source availability of LLoCO's codebase invites the wider research community to build upon, refine, and extend its capabilities, promising exciting developments ahead in the domain of long-context comprehension.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (47)
  1. Self-rag: Learning to retrieve, generate, and critique through self-reflection. arXiv preprint arXiv: 2310.11511, 2023.
  2. Longbench: A bilingual, multitask benchmark for long context understanding. arXiv preprint arXiv:2308.14508, 2023.
  3. Punica: Multi-Tenant LoRA Serving, 2023a. URL https://arxiv.org/abs/2310.18547.
  4. Extending context window of large language models via positional interpolation. arXiv preprint arXiv: 2306.15595, 2023b.
  5. LongloRA: Efficient fine-tuning of long-context large language models. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=6PmJoRfdaK.
  6. Adapting language models to compress contexts. arXiv preprint arXiv:2305.14788, 2023.
  7. A dataset of information-seeking questions and answers anchored in research papers. arXiv preprint arXiv:2105.03011, 2021.
  8. Multi-news: a large-scale multi-document summarization dataset and abstractive hierarchical model, 2019.
  9. In-context Autoencoder for Context Compression in a Large Language Model, October 2023. URL http://arxiv.org/abs/2307.06945. arXiv:2307.06945 [cs].
  10. gkamradt. Needle in a haystack - pressure testing llms., 2023. URL https://github.com/gkamradt/LLMTest_NeedleInAHaystack. [Accessed 26-03-2024].
  11. Realm: Retrieval-augmented language model pre-training. International Conference on Machine Learning, 2020.
  12. Kvquant: Towards 10 million context length llm inference with kv cache quantization. arXiv preprint arXiv: 2401.18079, 2024.
  13. Lora: Low-rank adaptation of large language models. In The Tenth International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=nZeVKeeFYf9.
  14. Lorahub: Efficient cross-task generalization via dynamic lora composition. arXiv preprint arXiv: 2307.13269, 2023.
  15. Efficient attentions for long document summarization. arXiv preprint arXiv:2104.02112, 2021.
  16. Unsupervised dense information retrieval with contrastive learning, 2021. URL https://arxiv.org/abs/2112.09118.
  17. Mistral 7b. arXiv preprint arXiv: 2310.06825, 2023a.
  18. Longllmlingua: Accelerating and enhancing llms in long context scenarios via prompt compression. ArXiv preprint, abs/2310.06839, 2023b. URL https://arxiv.org/abs/2310.06839.
  19. Llmlingua: Compressing prompts for accelerated inference of large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, December 2023c. URL https://arxiv.org/abs/2310.05736.
  20. Active retrieval augmented generation. Conference on Empirical Methods in Natural Language Processing, 2023d. doi: 10.48550/arXiv.2305.06983.
  21. The narrativeqa reading comprehension challenge. Transactions of the Association for Computational Linguistics, 6:317–328, 2018.
  22. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th Symposium on Operating Systems Principles, pp.  611–626, 2023.
  23. The power of scale for parameter-efficient prompt tuning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.  3045–3059, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics.
  24. Retrieval-augmented generation for knowledge-intensive nlp tasks. Neural Information Processing Systems, 2020.
  25. How long can context length of open-source LLMs truly promise? In NeurIPS 2023 Workshop on Instruction Tuning and Instruction Following, 2023. URL https://openreview.net/forum?id=LywifFNXV5.
  26. World model on million-length video and language with blockwise ringattention. arXiv preprint arXiv: 2402.08268, 2024a.
  27. Dora: Weight-decomposed low-rank adaptation. arXiv preprint arXiv:2402.09353, 2024b.
  28. When not to trust language models: Investigating effectiveness of parametric and non-parametric memories. Annual Meeting of the Association for Computational Linguistics, 2022. doi: 10.18653/v1/2023.acl-long.546.
  29. Learning to compress prompts with gist tokens. Neural Information Processing Systems, 2023. doi: 10.48550/arXiv.2304.08467.
  30. Learning to route among specialized experts for zero-shot generalization. arXiv preprint arXiv: 2402.05859, 2024.
  31. LLMLingua-2: Data distillation for efficient and faithful task-agnostic prompt compression. ArXiv preprint, abs/2403.12968, 2024. URL https://arxiv.org/abs/2403.12968.
  32. Quality: Question answering with long input texts, yes! North American Chapter of the Association for Computational Linguistics, 2021. doi: 10.18653/v1/2022.naacl-main.391.
  33. YaRN: Efficient context window extension of large language models. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=wHBfxhZu1u.
  34. Toolformer: Language models can teach themselves to use tools. Neural Information Processing Systems, 2023.
  35. S-LoRA: Serving Thousands of Concurrent LoRA Adapters, November 2023a. URL http://arxiv.org/abs/2311.03285. arXiv:2311.03285 [cs].
  36. Flexgen: High-throughput generative inference of large language models with a single gpu. In International Conference on Machine Learning, pp.  31094–31116. PMLR, 2023b.
  37. Large language models can be easily distracted by irrelevant context. International Conference on Machine Learning, 2023. doi: 10.48550/arXiv.2302.00093.
  38. Roformer: Enhanced transformer with rotary position embedding. NEUROCOMPUTING, 2021. doi: 10.1016/j.neucom.2023.127063.
  39. THUDM. Longbench: A benchmark for long-range language models. https://github.com/THUDM/LongBench, 2023.
  40. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  41. Self-instruct: Aligning language models with self-generated instructions. arXiv preprint arXiv:2212.10560, 2022.
  42. Efficient streaming language models with attention sinks. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=NG7sS51zVF.
  43. RECOMP: Improving retrieval-augmented LMs with context compression and selective augmentation. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=mlJLVigNHp.
  44. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. arXiv preprint arXiv:1809.09600, 2018.
  45. Raft: Adapting language model to domain specific rag. arXiv preprint arXiv:2403.10131, 2024a.
  46. H2o: Heavy-hitter oracle for efficient generative inference of large language models. Advances in Neural Information Processing Systems, 36, 2024b.
  47. Qmsum: A new benchmark for query-based multi-domain meeting summarization. arXiv preprint arXiv:2104.05938, 2021.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Sijun Tan (6 papers)
  2. Xiuyu Li (24 papers)
  3. Shishir Patil (2 papers)
  4. Ziyang Wu (21 papers)
  5. Tianjun Zhang (38 papers)
  6. Kurt Keutzer (199 papers)
  7. Joseph E. Gonzalez (167 papers)
  8. Raluca Ada Popa (20 papers)
Citations (4)
Youtube Logo Streamline Icon: https://streamlinehq.com