Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

FocusLLM: Scaling LLM's Context by Parallel Decoding (2408.11745v1)

Published 21 Aug 2024 in cs.CL and cs.AI
FocusLLM: Scaling LLM's Context by Parallel Decoding

Abstract: Empowering LLMs with the ability to utilize useful information from a long context is crucial for many downstream applications. However, achieving long context lengths with the conventional transformer architecture requires substantial training and inference resources. In this paper, we present FocusLLM, a framework designed to extend the context length of any decoder-only LLM, enabling the model to focus on relevant information from very long sequences. FocusLLM processes long text inputs by dividing them into chunks based on the model's original context length to alleviate the issue of attention distraction. Then, it appends the local context to each chunk as a prompt to extract essential information from each chunk based on a novel parallel decoding mechanism, and ultimately integrates the extracted information into the local context. FocusLLM stands out for great training efficiency and versatility: trained with an 8K input length with much less training cost than previous methods, FocusLLM exhibits superior performance across downstream long-context tasks and maintains strong LLMing ability when handling extensive long texts, even up to 400K tokens. Our code is available at https://github.com/leezythu/FocusLLM.

FocusLLM: Scaling LLM's Context by Parallel Decoding - An Evaluation

The paper "FocusLLM: Scaling LLM's Context by Parallel Decoding" by Zhenyu Li et al. addresses an imperative challenge in the domain of LLMs: extending the context length. The paper proposes a novel framework, FocusLLM, which enhances LLMs' ability to manage and utilize information spanning extensive context lengths efficiently.

The ability to handle long context sequences is crucial for numerous applications, including document summarization, question answering, and generating coherent long-form content. Traditional transformer architectures face significant challenges in this domain due to their quadratic computational complexity with sequence length and poor extrapolation performance on longer sequences. The acquisition of high-quality, long-text datasets further exacerbates the difficulty.

Methodology

FocusLLM introduces an innovative approach to extend the context length of decoder-only LLMs. It processes long inputs by segmenting them into chunks based on the original model's context length. To mitigate attention distraction issues, local context is appended to each chunk as a prompt, facilitating the extraction of essential information using a parallel decoding mechanism. The information from these chunks is subsequently integrated into the local context.

The architecture of FocusLLM is constructed to maximize training efficiency and versatility. Key features include:

  1. Length Scaling: By breaking positional limitations, FocusLLM allows models to handle texts exponentially longer than their original capacity.
  2. Training Efficiency: The original model parameters are frozen to retain generalization capabilities, and only a minimal number of additional trainable parameters are introduced. Training is completed with significantly less computational resources compared to previous methods.
  3. Versatility: FocusLLM excels in diverse downstream tasks, including question answering and LLMing over long documents.

Experimental Evaluation

LLMing

The evaluation spans multiple datasets, including PG19, Proof-Pile, and CodeParrot, with text lengths ranging from 4K to 128K tokens. FocusLLM achieves and maintains low perplexity across significantly extended sequences, up to 400K tokens. This performance is juxtaposed against strong baselines such as Positional Interpolation (PI), NTK-aware Scale ROPE, StreamingLLM, AutoCompressor-6K, YaRN-128K, LongChat-32K, LongAlpaca-16K, LongLlama, and Activation Beacon. The results indicate that FocusLLM not only matches but often surpasses these models, particularly in extremely long contexts, while maintaining a much lower computational and memory footprint.

Downstream Tasks

The practical efficacy of FocusLLM was further validated on two comprehensive benchmarks: Longbench and \infty-Bench. These benchmarks encapsulate various tasks such as question-answering, summarization, few-shot learning, and code completion, with average sequence lengths reaching up to 214K tokens in \infty-Bench.

FocusLLM consistently outperformed previous context scaling methods across diverse metrics, demonstrating superior performance in both LLMing and task-specific evaluations. This highlights its capability for precise understanding and reasoning over extensive text contexts.

Efficiency Considerations

A crucial aspect of FocusLLM's design is its efficiency in terms of memory usage and inference time:

  • Memory Footprint: FocusLLM exhibits a linear growth in memory usage without parallel processing, significantly reducing the overhead compared to traditional methods.
  • Inference Time: The parallel processing method in FocusLLM, although marginally slower than standard models, offers substantial improvements in inference time over other long-context methods.

Extensions and Future Work

FocusLLM sets a noteworthy benchmark for future research in long-context LLMs. Several avenues for further exploration include:

  1. Dynamic Chunk Sizing: Investigating adaptive chunk sizes based on the underlying data structure.
  2. Synthetic Data: Leveraging synthetic datasets to enhance the model's training efficiency and performance on specific tasks.
  3. Extended Context Applications: Applying FocusLLM to domains requiring even longer context handling, such as multi-document summarization and complex narrative generation.

Conclusion

FocusLLM presents a substantial advancement in extending the context length of LLMs with minimal computational cost. By introducing a novel parallel decoding mechanism, it adeptly addresses the limitations of traditional transformer architectures. Its ability to maintain performance across extensive text lengths, coupled with efficient training and inference processes, makes FocusLLM a significant contribution to the paper and application of long-context LLMs.

The insights and methodologies introduced here will undoubtedly stimulate further innovations in the design and utilization of LLMs capable of managing extensive and complex textual data, fostering new developments in natural language processing and beyond.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Zhenyu Li (120 papers)
  2. Yike Zhang (33 papers)
  3. Tengyu Pan (4 papers)
  4. Yutao Sun (18 papers)
  5. Zhichao Duan (8 papers)
  6. Junjie Fang (8 papers)
  7. Rong Han (8 papers)
  8. Zixuan Wang (82 papers)
  9. Jianyong Wang (38 papers)
Citations (2)
Github Logo Streamline Icon: https://streamlinehq.com

GitHub

Youtube Logo Streamline Icon: https://streamlinehq.com