Proact-VL: Real-Time Video Companion

Updated 4 July 2026

Proact-VL is a proactive streaming framework that leverages multimodal LLMs to deliver real-time gaming commentary and user guidance.
It employs a chunk-wise processing scheme with a dedicated response gate to ensure low-latency, autonomous decisions on when to speak.
The system is validated using the Live Gaming Benchmark, demonstrating strong performance in solo commentary, co-commentary, and real-time user assistance.

Searching arXiv for Proact-VL and closely related proactive-agent/video-LLM papers to ground the article. Proact-VL is a framework for transforming a multimodal LLM into a proactive, real-time interactive VideoLLM for AI companionship in streaming video, particularly gaming commentary and user guidance. It is designed around three coupled requirements: low-latency inference on continuous video streams, autonomous response triggering without explicit prompts, and control over both the quality and the quantity of generated content under real-time constraints. The system is introduced together with the Live Gaming Benchmark, a large-scale dataset and evaluation suite spanning solo commentary, co-commentary, and real-time user guidance, and is positioned within a broader line of work on proactive agents that emphasizes timely intervention rather than purely reactive response generation (Yan et al., 3 Mar 2026). In that sense, Proact-VL extends the proactive-assistance agenda from general working workflows to streaming video interaction, where the central question is not only what to say, but also whether to speak at all in a given second (Tang et al., 4 Feb 2026).

1. Problem setting and research context

Proact-VL targets real-time AI companions for streaming video, especially gaming commentary and user guidance. Its problem formulation is motivated by the claim that human-like companionship requires not only correct visual understanding, but also the ability to decide when to speak, how long to speak, and how to maintain low latency under continuous streaming input (Yan et al., 3 Mar 2026).

The paper identifies three core challenges. First is low-latency inference on continuous video streams. Second is autonomous response triggering, meaning that the model must decide whether a response is warranted without relying solely on explicit user prompts. Third is controlling response quality and length so that the assistant remains useful without over-talking. These constraints make the task materially different from conventional offline video captioning or instruction-following, where the model is typically invoked only after a complete clip or an explicit query is available (Yan et al., 3 Mar 2026).

The study instantiates these requirements through two gaming applications: Commentator, which includes autonomous live commentary and multi-speaker co-commentary, and Guide, which provides real-time player assistance and instructional guidance. The three scenarios used throughout the benchmark are Solo Commentary, Co-commentary, and Real-time User Guidance (Yan et al., 3 Mar 2026). A plausible implication is that gaming was selected not merely for convenience, but because it supplies densely eventful visual streams together with naturally occurring spoken supervision and tractable automatic evaluation signals.

This framing places Proact-VL within a broader proactive-agent research direction. In related work on proactive assistance in human-computer interaction, proactive systems are decomposed into determining when intervention is needed and how to assist once triggered, with explicit attention to interruption costs and missed opportunities (Tang et al., 4 Feb 2026). Proact-VL operationalizes a comparable concern in streaming video, but in a continuous, second-level interaction regime rather than desktop workflow logs.

2. Architecture and streaming formulation

Proact-VL is built as a framework on top of a multimodal LLM, using a chunk-wise input-output schema together with a proactive response gate (Yan et al., 3 Mar 2026). Its core unit of processing is the one-second chunk. At each second, the model consumes a structured input consisting of the current video chunk $V_t$ , an optional user query $Q_t$ , and background or environment context $B_t$ , including prior commentary summaries. It outputs either a short utterance for that second or silence. The streaming process is formalized as

$(U_t, K_t) = f_\theta(V_t, Q_t, B_t; K_{t-1}),$

where $U_t$ is the chunk-level output utterance, $K_t$ is the updated KV cache, and $K_{t-1}$ carries past context forward (Yan et al., 3 Mar 2026).

$\text{User: } \langle |history\_start| \rangle H \langle |history\_end| \rangle \langle |vision\_bos| \rangle \langle |VIDEO| \rangle \langle |vision\_eos| \rangle \langle |query\_start| \rangle Q \langle |query\_end| \rangle \langle |FLAG| \rangle$

(Yan et al., 3 Mar 2026).

This serialization enables a decide-then-generate pipeline. The model first executes a priming forward pass, then extracts the hidden state at <|FLAG|>, feeds it into a lightweight response head, and, contingent on the resulting score relative to a threshold $T$ , either generates commentary or remains silent (Yan et al., 3 Mar 2026). Architecturally, this separates response timing from text realization while still conditioning both on a shared multimodal context representation.

A central engineering component is the persistent transformer KV cache. All previous tokens remain available as context, and current generated utterances are appended to history and become future input. This makes the model incremental and efficient in a way that is compatible with streaming inference (Yan et al., 3 Mar 2026). To support arbitrarily long streams under a finite context window, Proact-VL uses a dual-cache sliding window, evicts the oldest 20% of the streaming cache once the context is too long, and applies a reverse-RoPE correction to re-base positions after eviction. The positional correction is derived from RoPE’s rotation property,

$R(p_1)R(p_2)=R(p_1+p_2), \quad R(-a)=R(a)^{-1}$

and for cached keys,

$Q_t$ 0

which is intended to avoid positional discontinuity after cache eviction (Yan et al., 3 Mar 2026).

3. Proactive response gate and training objective

The decision to speak is mediated by the <|FLAG|> token. Let $Q_t$ 1 denote the hidden state corresponding to <|FLAG|>. The model computes

$Q_t$ 2

and then

$Q_t$ 3

where $Q_t$ 4 is the speaking probability, $Q_t$ 5 is a fixed threshold, and $Q_t$ 6 indicates that the model should speak while $Q_t$ 7 indicates silence (Yan et al., 3 Mar 2026).

The paper explicitly states that this mechanism is not implemented as prediction of a special <|SILENCE|> token, because token-based silence is described as less stable and more sensitive to decoding hyperparameters. This choice is technically significant because it detaches the speak/silence decision from the autoregressive lexical distribution and therefore provides more direct control over response timing (Yan et al., 3 Mar 2026).

Training combines utterance quality and response behavior in a multi-term objective:

$Q_t$ 8

where $Q_t$ 9 is the causal language modeling loss for utterance quality, $B_t$ 0 is the response-behavior loss for speaking or silence decisions, and $B_t$ 1 is set to $B_t$ 2 in implementation (Yan et al., 3 Mar 2026).

The response loss has two components. The first is a weighted binary classification loss over ground-truth labels $B_t$ 3, indicating whether the human commentary speaks at time $B_t$ 4. To emphasize state transitions, the method uses transition-aware weights

$B_t$ 5

with $B_t$ 6, and then defines weighted BCE as

$B_t$ 7

The second component is stability regularization,

$B_t$ 8

which enforces smooth probabilities during persistent speak or silence regions and aligns the model’s average speaking rate with the human baseline. The full response loss is

$B_t$ 9

According to the paper, this design is intended to jointly improve trigger accuracy, speaking-rate control, reduced jitter or oscillation, and shorter, more usable outputs (Yan et al., 3 Mar 2026).

In comparison with adjacent proactive-agent formulations, this resembles a specialized instantiation of the general “When + How” decomposition used in proactive assistance benchmarks, though Proact-VL implements the timing module as an embedded gate rather than as a standalone classifier (Tang et al., 4 Feb 2026). This suggests a convergence between workflow-oriented proactive assistance and streaming-video companionship around the same core distinction: response triggering and response generation are related but not identical subproblems.

4. Live Gaming Benchmark and evaluation design

The empirical setting for Proact-VL is the Live Gaming Benchmark and Live Gaming Benchmark-Streaming (Yan et al., 3 Mar 2026). The dataset is built from 561 hours of English gaming commentary across 12 popular game titles spanning multiple genres and sourced from YouTube, with collection criteria prioritizing high popularity, strong engagement, expert broadcasts or influencers, and high narrative density. Videos are archived at 420p for efficiency (Yan et al., 3 Mar 2026).

The training split covers 10 games with a video-wise partition of 80% train / 10% test / 10% reserved. Training clips are 36-second clips with 18-second overlap, and for Minecraft, extra 60-second clips are also used. The total number of training samples is 128,000 (Yan et al., 3 Mar 2026).

The benchmark scenarios and test suites are as follows:

Component	Scope	Key details
Live Gaming Benchmark	Clip-level evaluation	In-domain subset: 10 games, 2,640 samples
Common-and-general subset	Clip-level generalization	Ego4D Goal-Step: 134 samples; Black Myth: Wukong: 240 samples
Live Gaming Benchmark-Streaming	Long-horizon evaluation	10 videos total; durations from 30 minutes to 2 hours

The three benchmark scenarios are Solo Commentary, Co-commentary, and User Guidance. The long-horizon streaming test set contains one full video per game from Solo and Co-commentary settings, yielding 10 videos total with durations ranging from 30 minutes to 2 hours (Yan et al., 3 Mar 2026).

Sampling is stratified by response rate to ensure diverse response density: 0–30%: 60 clips/game, 30–70%: 120 clips/game, and 70–100%: 60 clips/game (Yan et al., 3 Mar 2026). This is a notable design decision because it prevents evaluation from collapsing into a trivial sparse-trigger regime dominated by silence.

The evaluation protocol separates text quality from proactivity or timing quality. Text quality is measured with CC (Closed Captions), defined as win-rate versus Gemini 2.5 Pro; LiveU, an LLM-based second-level streaming usability score; and FinalQ, which evaluates the quality of the full concatenated script. Proactivity quality is measured with TimeDiff, the temporal offset between predicted response and the ground-truth response window; PAUC, a proactive interaction metric based on trajectory-like accumulation; and F1, event-level response detection over the full timeline (Yan et al., 3 Mar 2026). The primary judge model is GPT-5.1, with alternate judge setups tested in the appendix. Main results use a response threshold $(U_t, K_t) = f_\theta(V_t, Q_t, B_t; K_{t-1}),$ 0, while other analyses often use $(U_t, K_t) = f_\theta(V_t, Q_t, B_t; K_{t-1}),$ 1 (Yan et al., 3 Mar 2026).

5. Empirical performance and ablations

On the main benchmark, Proact-VL is reported as best overall on Live Gaming Commentary (Yan et al., 3 Mar 2026). For text quality, Table 1 reports the following values:

Scenario	CC	LiveU	FinalQ
Solo Commentary	53.62	6.89	5.48
Co-commentary	51.46	5.15	3.59
Guidance	42.60	7.52	6.02
Overall	49.23	6.52	5.03

These results are reported to outperform GPT-4o, Gemini 2.5 Pro, VideoLLM-online, MMDuet, Livestar, LiveCC-7B-Base/Instruct, and Streaming VLM on the benchmark (Yan et al., 3 Mar 2026).

For proactivity quality, Table 2 reports Overall F1: 64.87, Overall TimeDiff: 1.71, and Overall PAUC: 18.10 (Yan et al., 3 Mar 2026). The paper notes particularly strong F1 in co-commentary and guidance, low TimeDiff relative to many baselines, poor timing alignment in prior proactive methods, and weaker trigger quality in stable real-time baselines (Yan et al., 3 Mar 2026).

On the common-and-general commentary setting, Proact-VL retains strong generalization. Table 3 gives Ego4D: CC 63.43, F1 45.82 and Black Myth: Wukong: CC 55.21, F1 60.06 (Yan et al., 3 Mar 2026). The paper interprets this as evidence of transfer beyond the training games.

The long-horizon streaming benchmark shows that Proact-VL is more stable than Streaming VLM: text quality remains consistent over longer inference horizons, and response quality degrades only mildly and then stabilizes. Runtime is reported to be low enough to support roughly 10–15 FPS streams in practice under a 0.3-second commentary budget (Yan et al., 3 Mar 2026). This is notable because low latency is a first-class design target rather than an incidental systems metric.

Ablation studies emphasize the importance of the response-loss design. Table 6 shows that removing either loss term hurts performance, and removing the regularizer is especially damaging: F1 drops sharply, TimeDiff worsens significantly, and CC also falls (Yan et al., 3 Mar 2026). Threshold ablations show a tradeoff between coverage / recall / F1 and conservativeness / CC: higher thresholds yield fewer triggers and lower F1, while CC often improves with more conservative triggering; thresholds in the range 0.3–0.5 are reported as a good practical balance (Yan et al., 3 Mar 2026). Window-size ablation indicates that larger context windows generally improve CC up to a point while F1 stays fairly stable, with the best practical range around 16384–24576 tokens (Yan et al., 3 Mar 2026).

Training-data ablation shows that removing any major data source hurts performance: without gaming data, gaming performance drops; without Ego4D, egocentric or general guidance performance drops strongly; without Live-SFT, live-sports or streaming performance drops (Yan et al., 3 Mar 2026). The paper uses this to support the claim that the model benefits from a multi-source mixture.

6. Base models, implementation, and operational characteristics

The framework is tested with multiple backbones: Qwen2-VL, Qwen2.5-VL, Qwen3-VL, and LiveCC-Base (Yan et al., 3 Mar 2026). Although the Qwen-based backbones can achieve different tradeoffs, the paper’s main conclusion is that the Proact-VL framework itself improves proactive behavior consistently across bases (Yan et al., 3 Mar 2026). This is important because it attributes part of the gain to the streaming and gating design rather than solely to the particular pretrained backbone.

The reported implementation initializes from LiveCC-7B-Base and uses a learning rate of $(U_t, K_t) = f_\theta(V_t, Q_t, B_t; K_{t-1}),$ 2, cosine scheduler, batch size 64, 2,000 training steps, and gradient clipping 1.0 (Yan et al., 3 Mar 2026). Training cost is reported as about 200 H100 GPU-hours. Video decoding is performed at 2 FPS (Yan et al., 3 Mar 2026).

The visual token budget is specified by

MIN_PIXELS = $(U_t, K_t) = f_\theta(V_t, Q_t, B_t; K_{t-1}),$ 3
MAX_PIXELS = $(U_t, K_t) = f_\theta(V_t, Q_t, B_t; K_{t-1}),$ 4
MAX_VIDEO_PIXELS = $(U_t, K_t) = f_\theta(V_t, Q_t, B_t; K_{t-1}),$ 5

(Yan et al., 3 Mar 2026).

Operationally, the system combines 1-second chunking for decision and generation with 2 FPS video decoding for visual input processing (Yan et al., 3 Mar 2026). A plausible implication is that this reflects a compromise between temporal responsiveness and computational tractability. The framework is therefore not simply a captioning model run repeatedly on video frames; it is a streaming control system in which context accumulation, gating, and output-length control are jointly optimized.

The baseline taxonomy used in the paper groups comparison systems into three categories: commercial offline models such as GPT-4o and Gemini 2.5 Pro; proactive models such as VideoLLM-online, MMDuet, and Livestar; and real-time streaming models such as LiveCC-7B-Base, LiveCC-7B-Instruct, and Streaming VLM (Yan et al., 3 Mar 2026). Proact-VL is reported to outperform these baselines on both response quality and timing or proactivity (Yan et al., 3 Mar 2026).

7. Significance, limitations, and relation to proactive-agent research

The main contributions of Proact-VL are stated as: a new proactive real-time framework for video companions; a large-scale Live Gaming Dataset and Benchmark for commentary and guidance; a chunk-wise streaming plus proactive gate design with low-latency inference; a multi-term training objective that learns both what to say and when to speak; and strong empirical results on timing, quality, and long-stream stability (Yan et al., 3 Mar 2026).

Its broader significance lies in making proactivity an explicit modeling target in VideoLLMs. Rather than assuming that every time step requires textual output, Proact-VL makes silence a learned action mediated by a separate decision head. This aligns with a broader movement in proactive AI research that treats intervention timing as a primary variable rather than a side effect of generation (Tang et al., 4 Feb 2026). In workflow assistance, false positives are interpreted as interruption cost and false negatives as missed opportunities (Tang et al., 4 Feb 2026); in streaming commentary and guidance, the analogous balance appears as the tradeoff between over-talking and missing salient moments.

The paper also notes several limitations. Commentary can still be only weakly grounded and sometimes hallucinated. The system uses sparse 2 FPS sampling, which can miss transient events. It struggles with fine-grained OCR / small HUD text and numerical reasoning. It can fail on cluttered scenes and produce repetitive filler. The authors state that better high-FPS/high-resolution streaming encoders and stronger entity grounding are needed (Yan et al., 3 Mar 2026). These limitations are material because they constrain both the epistemic reliability of the generated commentary and the scope of environments in which the method can operate robustly.

A common misconception would be to treat Proact-VL primarily as a video-captioning system with improved latency. The paper instead presents it as a proactive streaming VideoLLM whose distinctive technical features are chunk-wise inference, a learned speak-or-silence gate, and stability-aware response training (Yan et al., 3 Mar 2026). Another plausible misconception would be that proactive behavior can be reduced to prompt engineering or a special silence token. The model’s design explicitly rejects token-based silence and instead introduces a dedicated gate trained with transition-aware supervision and regularization (Yan et al., 3 Mar 2026).

Within the emerging literature on proactive agents, Proact-VL and ProAgentBench illuminate complementary settings. ProAgentBench studies proactive multimodal assistance in continuous working workflows, using real user interaction logs and a hierarchical decomposition into When to Assist and How to Assist (Tang et al., 4 Feb 2026). Proact-VL studies proactive real-time video companionship, where the same underlying distinction is implemented through a chunk-level gate and short utterance generation (Yan et al., 3 Mar 2026). Taken together, these works suggest that proactive AI is increasingly being formulated as a temporally structured decision problem over continuous multimodal streams, in which response timing, intervention density, and content utility must be optimized jointly rather than sequentially.

Markdown Report Issue Upgrade to Chat

References (2)

Proact-VL: A Proactive VideoLLM for Real-Time AI Companions (2026)

Proactive Agents, Long-term User Context, VLM Annotation, Privacy Protection, Human-Computer Interaction (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Proact-VL.