Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

InfiniteHiP: Extending Language Model Context Up to 3 Million Tokens on a Single GPU (2502.08910v1)

Published 13 Feb 2025 in cs.CL and cs.LG
InfiniteHiP: Extending Language Model Context Up to 3 Million Tokens on a Single GPU

Abstract: In modern LLMs, handling very long context lengths presents significant challenges as it causes slower inference speeds and increased memory costs. Additionally, most existing pre-trained LLMs fail to generalize beyond their original training sequence lengths. To enable efficient and practical long-context utilization, we introduce InfiniteHiP, a novel, and practical LLM inference framework that accelerates processing by dynamically eliminating irrelevant context tokens through a modular hierarchical token pruning algorithm. Our method also allows generalization to longer sequences by selectively applying various RoPE adjustment methods according to the internal attention patterns within LLMs. Furthermore, we offload the key-value cache to host memory during inference, significantly reducing GPU memory pressure. As a result, InfiniteHiP enables the processing of up to 3 million tokens on a single L40s 48GB GPU -- 3x larger -- without any permanent loss of context information. Our framework achieves an 18.95x speedup in attention decoding for a 1 million token context without requiring additional training. We implement our method in the SGLang framework and demonstrate its effectiveness and practicality through extensive evaluations.

The paper addresses a major challenge in using LLMs: processing extremely long inputs efficiently. In many applications—from reading long articles to having extended conversations—the model’s ability to consider a larger context is crucial. However, as the number of input tokens grows, two problems arise:

  • The attention mechanism used by these models requires calculating interactions between every pair of tokens, which leads to computation and memory usage that grow very quickly (almost quadratically) with the length of the input.
  • The model’s key-value (KV) cache, which stores information from previous tokens for quick reference during generation, can become too large to fit in the limited memory of a modern GPU.

To solve these problems, the paper presents a novel framework that combines several smart techniques into one system, which enables LLMs to handle up to 3 million tokens on a single GPU. The work focuses on making the inference process both faster and less memory intensive without changing the original pre-trained model.


Key Ideas in the Approach

  1. Hierarchical Token Pruning
  • The framework first tackles the huge number of tokens by removing those that do not contribute much to the final output.
  • It divides the entire sequence into small, fixed-sized chunks and uses a multi-stage, hierarchical algorithm to decide which groups of tokens are important.
  • In each stage, the algorithm picks a “representative” token from each chunk. This selection is done by comparing the attention scores—the numbers that tell the model how much one token should “pay attention” to another—in a way that is efficient and only requires looking at a few tokens at a time.
  • By performing several rounds of such pruning, the system builds a sparse attention mask. This mask tells the model which tokens to focus on for each part of the input. The result is that only a small subset of tokens is involved in the expensive attention computations.
  1. Dynamic Positional Embedding Adjustments (RoPE Adjustments)
  • LLMs use positional embeddings to know the order of the tokens. However, these embeddings are normally trained on inputs of a fixed length.
  • To allow the model to generalize beyond this length (a property called out-of-length generalization), the system dynamically adjusts these positional embeddings.
  • Two main approaches are used:
    • Chunk-indexed RoPE: In this method, every group of tokens receives one common position identifier, which simplifies the calculation for that group.
    • Relative-style RoPE: This method incorporates relative positions between tokens so that the attention mechanism can better capture how tokens relate to each other beyond their original training range.
  • By selectively applying these strategies in different layers of the model, the approach ensures that the model understands the order and structure of very long inputs correctly.
  1. KV Cache Offloading with Optimized Memory Management
  • Instead of keeping the entire key-value cache in GPU memory (which would be impractical for millions of tokens), the proposed method offloads part of this cache to host (CPU) memory.
  • A careful caching policy based on the “Least Recently Used” (LRU) principle is applied. This means that tokens that have not been accessed for a while are moved out of GPU memory, and they are reloaded only when needed again.
  • This offloading greatly reduces the GPU memory pressure during inference yet allows the model to keep the context available without permanently discarding any information.
  1. Efficient Sparse Block Attention
  • Once the framework has determined which tokens are significant through hierarchical pruning, it performs a type of computation called block sparse attention.
  • In this method, only the interactions among the selected blocks of tokens are computed. This further cuts down the amount of computation required compared to evaluating every token pair.

Algorithm Details

  • Multi-Stage Pruning:

The pruning process starts by considering all tokens (except for a few tokens that are always kept for certain purposes, like “sink” tokens and the most recent tokens). In each stage, the algorithm: 1. Groups tokens into chunks. 2. Chooses a representative token from each chunk by quickly estimating which token might yield the highest attention score. 3. Uses these estimates to discard chunks that are unlikely to contribute significant information. 4. Passes the remaining tokens to the next stage for further refinement.

  • Representative Token Selection:

Instead of examining every token in a chunk, the method uses a hierarchical selection approach that zeroes in on the token with the highest potential contribution, doing so in time proportional to the logarithm of the chunk size.

  • Sparse Attention Mask Caching:

To further reduce latency during decoding (when the model generates tokens one at a time), the system periodically updates and caches the sparse attention masks instead of computing them from scratch at every single step. This reuse of computed masks helps keep the inference process fast.


Experimental Results and Impact

  • In evaluations on widely used benchmarks, the framework demonstrates significant improvements in decoding speed for long contexts. For example, when processing a context with 1 million tokens, the system’s attention decoding speed is many times faster than standard methods that rely on dense attention.
  • The combined techniques allow the model to extend its practical context length from its original limit (typically in the tens of thousands of tokens) to as many as 3 million tokens on a single GPU.
  • The experiments also show that this efficiency is achieved without sacrificing the model’s ability to understand and generate coherent text from long inputs.

Conclusion

The framework introduced in this paper presents a comprehensive, training-free solution for one of the most difficult aspects of deploying LLMs: managing long-context inputs. By intelligently pruning irrelevant tokens, adjusting positional embeddings on the fly, offloading key-value caches to host memory, and computing sparse attention, the method allows a single GPU to handle applications that require processing millions of tokens. This approach paves the way for more practical use of LLMs in scenarios such as long-document understanding, extended dialogues, and other tasks where context length has previously been a major bottleneck.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Heejun Lee (7 papers)
  2. Geon Park (5 papers)
  3. Jaduk Suh (2 papers)
  4. Sung Ju Hwang (178 papers)