KV Flush in Streaming VideoQA Systems
- KV flush is a mechanism that offloads older key-value attention caches from GPU to host RAM/disk, ensuring efficient memory use in long video streams.
- It decouples video encoding from question answering, minimizing recomputation and optimizing high-throughput StreamingVQA operations.
- Empirical results demonstrate a 34% reduction in computational costs and stable GPU memory usage during hour-long video processing.
A KV flush refers to the process of systematically evicting key-value (KV) attention caches from GPU memory to host RAM or disk in order to manage resource consumption during inference with sequence models. This mechanism is particularly relevant for streaming video question answering (StreamingVQA) systems based on Video LLMs (Video-LLMs), where the processed video context can quickly outgrow available GPU memory if not controlled. In the ReKV approach, KV flush enables efficient, prompt, and memory-safe handling of arbitrarily long video streams and rapid on-demand query answering, without requiring repeated re-encoding of the entire video context (Di et al., 1 Mar 2025).
1. System Architecture and KV-Cache Lifecycle
The streaming VideoQA system integrating KV flush consists of three major components: a Video Stream Encoder (typically on GPU–A), a Global KV-Cache Store (in host RAM and optionally on disk), and one or more Retriever & QA Worker processes (on separate GPUs). The operational workflow is:
- Incoming video frames are processed by a vision encoder on GPU–A, which outputs tokenized representations.
- A decoder applies sliding-window attention to generate new (key, value) pairs for each input chunk.
- GPU–A maintains an in-device KV cache, retaining only the most recent frames and evicting older frames by flushing those KV-pairs to RAM (and disk if necessary).
- On receipt of a user question, a Retriever process selects a subset of KV-pairs relevant to the question from RAM/disk and loads them onto a QA GPU, which runs autoregressive decoding with those caches and the question tokens.
This division separates video encoding and question answering across devices, allowing high-throughput continuous ingestion and efficient multi-turn QA without recomputation (Di et al., 1 Mar 2025).
2. Sliding-Window Attention and KV-Flush Mechanism
Sliding-window attention restricts computation to a configurable local window of recent frames. At any time , let be the past cache, be the current token sequence, and the local window. The chunk’s output is defined as:
Once the number of past cached frames exceeds , the oldest KV entries are flushed to RAM/disk. This keeps GPU memory utilization bounded while retaining the information content required for attention in the local context.
3. KV Flush, Serialization, and Retrieval Workflow
The flushing process operates as follows:
- GPU maintains a deque with up to most-recent (k,v) pairs.
- Upon exceeding this size, the oldest cache is serialized (frame index, layer, head, FP16 tensors) and appended to RAM_KV, which may spill to DISK_KV if RAM is full.
- Retrieval upon query uses either an external retriever (e.g., CLIP-style encoder on video frames) or an internal similarity (mean-pooled LLM KV vectors).
- The retriever selects the top- relevant frames using cosine similarity, then the corresponding caches are reloaded to the GPU for question answering.
A schematic pseudocode representation illustrates the main steps:
1 2 3 4 5 |
if len(GPU_KV) > M_gpu_limit: k_old, v_old = GPU_KV.popleft() RAM_KV.append(serialize(k_old, v_old)) if size_of(RAM_KV) > RAM_capacity: DISK_KV.write(RAM_KV.pop(0)) |
1 2 3 4 5 6 7 8 9 10 |
def load_relevant_kv(query): idxs = retrieve_indices(query, RAM_KV, DISK_KV, top_r) kv_list = [] for i in idxs: if i in RAM_index: kv_list.append(deserialize(RAM_KV[i])) else: kv_list.append(DISK_KV.read(i)) GPUB_KV = kv_list return GPUB_KV |
Each record is structured as: [ frame_idx (int32) | layer_idx (uint8) | head_idx (uint8) | k FP16[D] | v FP16[D] ]. RAM_KV is a sequential array of these, while DISK_KV uses an append-only file plus an index file.
4. Retrieval Strategies and Scoring
Query-relevant KV-caches are identified via two primary strategies:
- External Retrieval: Utilizes a CLIP-like embedding on video frames and on the query. Cosine-similarity with learned temperature :
- Internal Retrieval: Utilizes mean-pooled KV vectors and query tokens directly from the LLM, with :
In both cases, the retriever selects the top- frames or blocks for reloading and downstream QA.
5. Performance and Efficiency Analysis
The primary impact of the KV flush mechanism is the transformation of resource and computational profiles:
- Memory Cost: For frames with tokens, layers, heads, and -dim vectors, total KV size is:
For LLaVA-OV-7B (, , , , ), the total for one hour at 0.5 FPS is ~18.8 GB.
- Compute Cost: Without flushing, per-QA cost is FLOPs (re-encoding all frames for each question). With KV flush and retrieval, video encoding is amortized (), and per-QA cost depends only on retrieval over caches and answer generation ().
- Empirical Results: At 360 QAs/hour, internal retrieval takes 5.6 TFLOPs/QA, baseline 8.5 TFLOPs/QA (34% reduction).
- GPU Memory: Without flushing, peak memory grows linearly and exceeds 500 frames (out-of-memory). With flush and internal retrieval, 1-hour streams achieve stable 38 GB peak usage.
- Latency: Uniform sampling, external, and internal retrieval yield per-QA response times of 2.9s, 5.8s, and 3.3s, respectively (Di et al., 1 Mar 2025).
- Encoding Throughput: Streaming video encoding throughput is 11 FPS for LLaVA-OV-7B due to decoupling of encode and QA operations.
6. Applications and Significance
KV flush enables scaled deployment of streaming VQA systems on limited hardware by bounding GPU memory, decoupling video ingestion from QA, and supporting high query rates on hour-long (or longer) videos. Video encoding can be performed once, with caches reused for multiple user queries, and only minimal, question-relevant caches are cycled through the high-performance GPU pathway. This approach addresses a central inefficiency in traditional VideoQA systems, where recomputation and unbounded attention windows are prohibitive. The described strategy can plausibly inform future memory management techniques for other real-time, sequence-to-sequence transformer models handling long-context modalities.
7. Summary Table: KV Flush Components in StreamingVQA
| Component | Function | Storage Location |
|---|---|---|
| Recent KV Cache | Sliding-window attention context | GPU (≤ frames) |
| Flushed KV Cache | Offloaded, queryable historic context | RAM (primary), Disk (overflow) |
| Retrieved KV Cache | Query-relevant history for QA | QA GPU |
The lifecycle managed by KV flush underpins the efficiency and scalability demonstrated within the ReKV StreamingVQA system, resulting in substantial improvement in memory overhead, computational cost, and responsiveness compared to non-flushed baselines (Di et al., 1 Mar 2025).