Papers
Topics
Authors
Recent
2000 character limit reached

Inf-Streams-Eval Benchmark for Streaming VLMs

Updated 23 December 2025
  • Inf-Streams-Eval is a benchmark for real-time evaluation of vision-language models using precise per-second text alignment on multi-hour video streams.
  • It leverages full-length sports games and rigorous transcript cleaning via ASR and GPT-based editing to ensure detailed temporal correspondence.
  • Its dual protocol and multimodal metrics rigorously assess latency, memory use, and long-term context management, pushing VLMs towards real-world deployment.

Inf-Streams-Eval is a benchmark specifically designed to evaluate vision-LLMs (VLMs) in the context of real-time, second-level understanding across multi-hour, near-infinite video streams. Developed for applications such as live sports commentary and autonomous agent systems, it addresses a fundamental limitation of previous benchmarks by requiring precise, per-second frame-to-text alignment at significant temporal scale, while maintaining stringent requirements for inference latency and memory utilization (Xu et al., 10 Oct 2025).

1. Motivation and Benchmark Design Principles

Traditional video-LLM benchmarks largely focus on short-form video inputs (seconds to minutes) or operate at coarse-grained retrieval, lacking evaluation of models' ability to maintain accurate, temporally granular representations over extended, continuous visual streams. In real-world deployments—such as robotic monitoring, live broadcast captioning, or surveillance—VLMs must align text to each incoming frame in real time, without letting memory or computational cost diverge as stream duration grows (Xu et al., 10 Oct 2025).

Inf-Streams-Eval is constructed to address these gaps through three core design principles:

  • Per-second alignment: Models must produce a text segment sts_t for each second tt, precisely describing the corresponding video content.
  • Multi-hour video scale: Benchmarks are drawn from actual full-length sports games, each exceeding two hours, compelling models to demonstrate both immediate temporal coherence and long-term context management.
  • Two-mode protocol: Both "chunked" inference for non-streaming-capable baselines and genuine infinite-stream evaluation for streaming architectures are supported.

2. Dataset Construction and Curation

The dataset consists of 20 full-game videos sourced from five major sports domains: basketball, soccer, ice hockey, baseball, and American football. Initial video selection pooled from 6,000+ hours of 360P–720P footage at 24 FPS, then subselected such that each sport is represented by one held-out game for evaluation, with average lengths of 2.12 hours (Xu et al., 10 Oct 2025).

Key steps in data curation include:

  • ASR Extraction: WhisperX segments each game’s audio into time-stamped transcripts.
  • Cleaning Pass: GPT-5 RPC is used to process transcripts in 120-second blocks, marking sentences as "keep," "edit," or "delete." Edits are temporally normalized; non-retained text or segments are excluded, ensuring robust correspondence between visuals and commentary intervals. This yields: 46.32% kept, 37.89% edited, 15.79% deleted.
  • Segmentation: Each game is partitioned into contiguous 100-second windows, filtered to retain only those with ≥200 ground-truth words of continuous ASR commentary, producing approximately 400 segments (Xu et al., 10 Oct 2025).

3. Annotation, Task Definition, and Alignment

The annotation protocol establishes dense, per-second targets for model prediction. For each game, video is sampled at 1 Hz to produce a sequence of frames F={ft}t=1TF = \{f_t\}_{t=1}^T. Text commentary is aligned in parallel, S={st}t=1TS = \{s_t\}_{t=1}^T, where sts_t denotes the commentary spoken during second tt (Xu et al., 10 Oct 2025).

The core task is formalized as producing, for each frame ftf_t at time tt, a predicted text segment s^t\hat{s}_t. The alignment function A:FSA: F \rightarrow S seeks A(ft)=stA(f_t) = s_t. Benchmarkers compare s^t\hat{s}_t to sts_t via automated LLM-based evaluation, utilizing both exact-match and meaning-equivalent scoring (GPT-5 LLM judge), thereby accommodating variability in natural language output while maintaining consistency in alignment evaluation.

Quality control is effected through manual inspection of 10% random windows, verifying that cleaned transcript timing diverges from ground-truth by no more than ±1 second.

4. Evaluation Protocol and Metrics

Inf-Streams-Eval mandates multimodal metrics to rigorously assess both fidelity and efficiency:

  • Per-Second Accuracy:

Acc=1Tt=1T1[s^t=st]\mathrm{Acc} = \frac{1}{T} \sum_{t=1}^{T} \mathbb{1}[\hat{s}_t = s_t]

where the indicator is rendered by an LLM “meaning-equality” judgment rather than strict token match.

  • Alignment F1 (Optional): Precision and recall at the token level within each second, aggregated across the dataset,

F1=2PrecisionRecallPrecision+Recall\mathrm{F1} = \frac{2 \cdot \mathrm{Precision}\cdot \mathrm{Recall}}{\mathrm{Precision} + \mathrm{Recall}}

  • Latency: Mean wall-clock inference time per frame (or per output token), with the real-time requirement defined as Lms/frame100L_{\mathrm{ms/frame}} \le 100 ms. For example, StreamingVLM registers \approx0.08 s per token on an NVIDIA H100.
  • Throughput and Memory: Reported in frames per second (FPS) and peak GPU memory, respectively. StreamingVLM achieves 8 FPS at sub-40 GB steady-state GPU memory (Xu et al., 10 Oct 2025).
  • Pairwise Win Rate: For every video segment, outputs from two models and ground truth sts_t are submitted to GPT-5, which returns a preference. The win rate W(A,B)W(A,B) is the fraction where A is selected over B, empirically validating model alignment against baselines such as GPT-4o mini and LiveCC (Xu et al., 10 Oct 2025).

A summary of win rates is given below:

Model vs GPT-4o vs LiveCC† vs LiveCC∞
Qwen-2.5† 0.01 20.44 95.97
LiveCC-7B† 15.73
LiveCC-7B∞ 1.82
StreamingVLM∞ (ours) 66.18 87.81 99.12

†: chunked (100 s) mode, ∞: infinite streaming

5. Baselines, Results, and Performance Analysis

Inf-Streams-Eval has been used to evaluate a range of models:

  • GPT-4o mini (chunked only)
  • LiveCC-7B-Instruct (chunk and infinite modes)
  • Qwen-2.5-VL-7B-Instruct (no streaming fine-tuning)
  • ReKV (applied to Qwen or StreamingVLM)
  • StreamingVLM (streaming-capable architecture)

Key empirical findings include:

  • StreamingVLM achieves a 66.18% win rate vs. GPT-4o mini and 87.81–99.12% vs. LiveCC on infinite streams.
  • Efficiency benchmarks indicate StreamingVLM sustains 8 FPS for over 3 hours, at <40 GB GPU memory, with latency well within real-time requirements (Xu et al., 10 Oct 2025).
  • Ablation reveals that contiguous RoPE positional encoding is essential; omission drops infinite-stream win rate from 66.18% to 25.09%. Memory of recent tokens (i.e., a sink of 512 and text window of 512 tokens) provides the best trade-off between long-term alignment and resource use.

6. Methodological Recommendations and Extensibility

For future extensions:

  • Training: Overlapped-chunk full attention (e.g., 24 s window, 12 s overlap) is recommended to align training with streaming inference patterns.
  • Positional Embedding: Contiguous RoPE should be implemented to ensure positional indices remain bounded within the training distribution.
  • Evaluation: Domain coverage can be extended (e.g., to news or surveillance) by adding new multi-hour video sources and corresponding per-second transcripts, adhering to the pairwise LLM-judged protocol to maintain evaluation consistency (Xu et al., 10 Oct 2025).

7. Significance and Position within the Benchmarking Landscape

Inf-Streams-Eval addresses the methodological insufficiency of prior short-form or coarse-alignment video–language benchmarks by establishing stringent demands for streaming, temporally granular, multi-hour video understanding. Its design prioritizes real-time processing—explicitly exposing coherence, latency, and memory trade-offs unheard of under finite or chunked evaluation regimes. A plausible implication is that models and architectures optimized for success on Inf-Streams-Eval are likely better suited to real-world, continuous, time-critical multimodal applications, as opposed to those evaluated solely on short, batch-oriented or retrieval tasks (Xu et al., 10 Oct 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Inf-Streams-Eval Benchmark.