Papers
Topics
Authors
Recent
Search
2000 character limit reached

StreamingQA Benchmark Overview

Updated 26 January 2026
  • StreamingQA is a benchmark framework designed to assess models' capacity to maintain up-to-date knowledge over dynamic, temporally ordered data streams.
  • It encompasses both text-based and multimodal datasets, challenging systems with hierarchical reasoning, event detection, and contextual memory tasks.
  • Key evaluation metrics, including EM, F1, and reasoning scores, quantitatively measure model adaptability, memory retention, and real-time performance.

StreamingQA benchmarks evaluate the continuous adaptation of machine learning models to evolving, temporally ordered data streams. Unlike static knowledge evaluation, StreamingQA probes an agent's capacity to maintain up-to-date knowledge, integrate novel facts, and reason contextually over temporally indexed inputs. These benchmarks have diversified from text-based continual learning frameworks to multimodal domains including vision and video, where additional challenges of memory, event detection, and hierarchical reasoning emerge.

1. Dataset Construction and Streaming Protocols

StreamingQA datasets are characterized by their dynamic corpora and temporally aligned queries. The prototypical StreamingQA corpus comprises the WMT English news-crawl spanning 2007–2020, yielding approximately 11 million raw articles and 47 million six-sentence passages (Liška et al., 2022, Li et al., 2024, Kim et al., 2023). Each article cjc_j is annotated by its publication date dc,jd_{c,j}, with associated QA pairs where the question date dq,id_{q,i} is either close (recent subset) or arbitrarily far (past subset) from dc,jd_{c,j}. Question sources are split between human-written and LLM–generated, yielding train/test splits up to 100,000/35,000 QA pairs, with answer types distributed across named entities (≈47%), phrases (≈41%), and dates (≈12%) (Liška et al., 2022).

In video-centric StreamingQA benchmarks such as RTV-Bench and StreamingBench, annotated QA pairs are tethered to precise video timestamps and, where relevant, accompanying audio clips. RTV-Bench includes 200 real-world videos (150 hours, 4,631 QA pairs), designed to probe multi-timestamp understanding and hierarchical reasoning (Xun et al., 4 May 2025). StreamingBench provides 900 videos covering eight domains, with five multiple-choice questions per video, and further categorizes tasks into Real-Time Visual, Omni-Source, and Contextual Understanding (Lin et al., 2024).

The key protocol is to evaluate models at regular intervals or in an online fashion, as new documents, frames, or audio signals stream in. For textual benchmarks, quarterly updates are performed to simulate the model's lag relative to current knowledge. In video StreamingQA, models are exposed to sequential frames, and queries are timestamped to demand real-time or near-real-time response.

2. Task Definitions, Question Types, and Temporal Reasoning

StreamingQA benchmarks feature diverse task definitions rooted in the streaming nature of the data:

  • Textual StreamingQA: The goal is to answer queries using knowledge only up to the question date, dq,id_{q,i}, given all prior documents Ct\mathcal{C}_{\le t}. Formally, the model must compute a^iargmaxap(aqi,dq,i,Ct)\hat{a}_i \approx \arg\max_a p(a|q_i, d_{q,i}, \mathcal{C}_{\le t}) (Liška et al., 2022). This directly measures knowledge currency and adaptation.
  • Video/Multimodal: In RTV-Bench, questions may reference multiple, non-contiguous timestamps T={t1,t2,...,tk}T = \{t_1, t_2, ..., t_k\} and require temporal aggregation, causal inference, or event detection (“How many times did the car stop before the light turned green?”) (Xun et al., 4 May 2025). Question structure is hierarchical:
    • Level-1: Perceptual ("What object is present at tt?")
    • Level-2: Relational/Temporal ("How many events between t1t_1 and t2t_2?")
    • Level-3: Causal/Counterfactual ("Why did event X occur after Y?")

StreamingBench further includes proactive detection (“emit a token when a goal occurs”), contextual memory tasks (resolving anaphora, ignoring outdated frames), and multimodal fusion queries requiring vision-audio alignment (Lin et al., 2024).

3. Evaluation Methodologies and Metrics

StreamingQA benchmarks employ rigorous, multi-axis evaluation protocols:

Text QA Metrics:

  • Exact Match (EM): EM(a^,a)=1(normalize(a^)=normalize(a))\mathrm{EM}(\hat{a}, a^*) = \mathbb{1}(\mathrm{normalize}(\hat{a}) = \mathrm{normalize}(a^*))
  • F1 Score: For token-level overlap, F1(a^,a)=2precisionrecallprecision+recall\mathrm{F1}(\hat{a}, a^*) = \frac{2 \cdot \mathrm{precision} \cdot \mathrm{recall}}{\mathrm{precision} + \mathrm{recall}}
  • Retrieval Recall@kk and MRR: Standard for semi-parametric (retriever+generator) architectures.

Video QA Metrics:

  • Perception Accuracy: accuracyvis=Ncorrect_framesNtotal_framesaccuracy_{vis} = \frac{N_{correct\_frames}}{N_{total\_frames}}
  • Semantic Understanding F1: F1sem=2PrecisionRecallPrecision+RecallF1_{sem} = \frac{2 \cdot Precision \cdot Recall}{Precision + Recall}
  • ReasoningScore: ReasoningScore=i=1Nstepswi1(stepi  correct)i=1NstepswiReasoningScore = \frac{\sum_{i=1}^{N_{steps}} w_i \, \mathbb{1}(\text{step}_i\; \text{correct})}{\sum_{i=1}^{N_{steps}} w_i}
  • Aggregated mAP: mAP=1Tt=1TAP(t)mAP = \frac{1}{T} \sum_{t=1}^T AP(t) across all relevant timestamps.

StreamingBench utilizes taskwise accuracy, including windowed proactive detection (tmodeltgt2|t_{model} - t_{gt}| \leq 2 s) for real-time triggers and robustness analyses against misleading context and dialogue memory (Lin et al., 2024).

4. Model Adaptation Paradigms and Comparative Results

4.1 Textual Models: Parametric vs Semi-parametric Adaptation

  • Closed-book QA (parametric): All knowledge encoded in LM weights. Updates occur by:
  • Open-book QA (semi-parametric): Retriever+generator. Adaptation via search index updates (index refresh) or additional generator fine-tuning.

Empirical findings reveal that index-only semi-parametric adaptation (adding new documents) yields the largest, nearly instantaneous improvement on recent queries (FiD+IU: +22 points F1) with zero catastrophic forgetting, especially for low-frequency entities (Liška et al., 2022). Generator fine-tuning closes much of the remaining gap for high-frequency entities, while closed-book iterative fine-tuning is beneficial but cannot fully supplant retraining.

4.2 Generative Retrieval in DynamicIR Settings

Generative Retrieval (GR) architectures, which leverage autoregressive LMs over n-gram–indexed corpora, demonstrate superior adaptability and efficiency relative to dual-encoder (DE) methods. On StreamingQA, GR outperforms DE in Hit@5 by +10.9–18.6 pp across update regimes and achieves up to 10× lower FLOPs per query, 7× faster indexing time, and 4× smaller storage footprint (Kim et al., 2023). GR's reduced sensitivity to timestamp tokens and improved semantic generalization speed its integration of new facts and retention of old facts.

4.3 Multimodal StreamingQA: Video Benchmarks

RTV-Bench comparative results indicate that proprietary vision-LLMs (GPT-4o, Gemini 2.0) outperform open-source real-time pipelines and offline models on reasoning (mAP: GPT-4o 58.3%, Gemini 54.8%). However, perceptual accuracy is less affected by stream length than reasoning depth. Larger model size and higher frame sampling rates do not necessarily improve performance, whereas adaptive keyframe selection can yield up to +12% reasoning score (Xun et al., 4 May 2025).

StreamingBench reports that even top models (Gemini 1.5 Pro at 67.07% overall accuracy) fall short of human performance across 18 tasks (human: 91.66%), with pronounced deficits in Omni-Source and Contextual Understanding (Lin et al., 2024). Proactive detection and sequential memory remain significant bottlenecks.

5. Memory Compression and Continual Knowledge Learning

Compression Memory Training (CMT) introduces a parameter-free online adaptation paradigm augmenting LLMs with a memory bank of learned compressed representations for each incoming document (Li et al., 2024). Each article dd is compressed into MdRk×dhidM^d \in \mathbb{R}^{k \times d_{hid}} (with k=24k=24) via a cross-attentive soft token mechanism. Memory-aware objectives and self-matching losses ensure answer probabilities are boosted by relevant context, while top-kk aggregation delivers scalable retrieval across thousands of memories.

On StreamingQA, CMT achieves 18.36 EM / 25.98 F1_1 on Llama-2-7B, outperforming MAC (+4.07 EM, +4.19 F1_1), and retains up to 80% performance after hundreds of documents streamed. Ablations show memory-aware loss and self-matching are critical for knowledge retention and robustness against irrelevant context.

6. Bottlenecks, Diagnostic Findings, and Directions

StreamingQA benchmarks diagnose crucial limitations:

  • Context window constraints—transformer models lose relevant information over long streams.
  • Naïve sampling and indexing policies degrade reasoning for multi-timestamp and long-context queries.
  • Lack of explicit temporal memory modules impedes event and anomaly detection in video streams.
  • Semi-parametric adaptation circumvents catastrophic forgetting in textual benchmarks but may lag on high-frequency entities without parametric updates (Liška et al., 2022, Kim et al., 2023).
  • Multi-modal MLLMs fail to robustly align vision and audio or maintain dialogue continuity under sequential QA (Lin et al., 2024).

Recommended research directions include hierarchical memory graphs, adaptive event-driven sampling, modular fast-path/slow-path architectures, and temporal convolutional backbones to extend context. StreamingBench and RTV-Bench both highlight the need for native streaming interfaces supporting unbounded video/audio and proactive output generation.

7. Benchmark Toolkits, Reproducibility, and Resource Access

RTV-Bench and StreamingBench provide open code, comprehensive annotation suites, and visualization tools for skill profiling. Example RTV-Bench setup:

  1. pip install -r requirements.txt
  2. Download video/annotation packs via provided README links.
  3. Benchmark via python eval.py --model MODEL_NAME --fps 1.
  4. Skill profile plotting: python analyses/plot_skill_profile.py (Xun et al., 4 May 2025).

StreamingQA text benchmarks likewise distribute large-scale QA pairs, date-aligned corpora, and baseline model implementations. All major methods report reproducible EM/F1 or Hit@N metrics under controlled evaluation regimes.


In sum, StreamingQA benchmarks anchor the empirical study of continual, adaptive, and context-sensitive question answering across text and multimodal domains. Through fine-grained temporal assignment, hierarchical reasoning probes, and multidimensional scoring, they drive next-generation systems toward human-like comprehension under perpetual data evolution.

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to StreamingQA Benchmark.