Papers
Topics
Authors
Recent
Search
2000 character limit reached

Time-CoT: Streaming Video Reasoning

Updated 9 March 2026
  • Time-CoT is a streaming paradigm that enables real-time Chain-of-Thought reasoning in LVLMs, facilitating low-latency and temporally aligned multi-modal processing.
  • It employs a dual KV-cache mechanism to decouple visual encoding from textual decoding, drastically reducing the time-to-first-token compared to conventional batch approaches.
  • The framework uses streaming attention masks and modality-decoupled positional encodings to enforce temporal causality, enhancing performance in applications like surveillance and robotics.

Time Chain-of-Thought (Time-CoT), and specifically the "Think-as-You-See" (TaYS) framework, constitutes a streaming paradigm for Chain-of-Thought reasoning in Large Vision-LLMs (LVLMs). In contrast to traditional batch processing, where the entire video sequence must be available before any inference is possible, Time-CoT enables temporally aligned, low-latency reasoning by equipping LVLMs to interleave multi-modal perception and natural language reasoning in real time. The framework centers on aligned reasoning units, a dual KV-cache structure, streaming attention masks, modality-decoupled positional encodings, and parallelized CoT generation, collectively yielding improved accuracy and significantly reduced latency for video reasoning tasks (Zhang et al., 3 Mar 2026).

1. Streaming CoT Problem Formulation and Aligned Reasoning Units

Conventional Offline Video Chain-of-Thought (CoT) reasoning relies on presenting the full video V\mathcal V to the model encoder prior to decoding any language tokens:

hi=Decoder(y<i;Enc(V))h_i = \mathrm{Decoder}(y_{<i};\,\mathrm{Enc}(\mathcal V))

This batch-style approach is unsuitable for real-world, streaming scenarios. The streaming CoT regime instead restricts the encoder’s input to the prefix Vt={F1,,Ft}\mathcal V_{\le t} = \{F_1,\dots,F_t\} available at current time tt, with reasoning outputs hith_i^t interleaved stepwise as frames are received:

hit=Decoder(y<it;Enc(Vt),C<t)h_i^t = \mathrm{Decoder}(y_{<i}^t;\, \mathrm{Enc}(\mathcal V_{\le t}), C_{<t})

At each step, sampled frames FtF_t are associated with (Qt,Rt,At)(Q_t, R_t, A_t) triplets (Question, Reasoning, Answer), minimally segmented by an <EOT> delimiter. This enforces that at inference, the model emits temporally aligned reasoning units immediately following each frame’s arrival. The dataset for such streaming trajectory supervision is derived by temporally segmenting keyframe annotations.

2. Parallel Dual KV-Cache Mechanism

TaYS achieves true concurrency by decoupling visual encoding from textual decoding via a dual KV-cache:

  • Cv\mathcal C_v: a read-only video (visual) cache storing the encoded visual tokens of all arrived frames,
  • Cr\mathcal C_r: a growing text cache accumulating all previously generated reasoning tokens.

At each new frame tt:

  • Cv(t)=Cv(t1)Enc(Ft)\mathcal C_v^{(t)} = \mathcal C_v^{(t-1)} \cup \mathrm{Enc}(F_t) (frame encoding is non-blocking and independent of reasoning state),
  • Context for decoding is constructed as Merge(Cv(t),Cr(t1))\mathrm{Merge}(\mathcal C_v^{(t)}, \mathcal C_r^{(t-1)}),
  • RtDecoder(Context)R_t \leftarrow \mathrm{Decoder}(\mathrm{Context}),
  • Cr(t)=Cr(t1)Dec(Rt)\mathcal C_r^{(t)} = \mathcal C_r^{(t-1)} \cup \mathrm{Dec}(R_t).

Merge operations are pointer-based and do not require synchronization, allowing reasoning token generation to proceed without waiting for visual cache updates. In contrast, interleaved single-cache paradigms stall token generation until new visual embeddings are available.

3. Streaming Attention Masks and Modality-Decoupled Positional Encodings

Temporal causality is enforced through a two-part streaming attention mask M~(i,j)\widetilde M(i,j) over the joint sequence:

  • For reasoning token i>Nvi > N_v attending to visual token jNvj \le N_v, access to frames is restricted via jiNvj \le i - N_v:

M~(i,j)={,i>Nv,jNv,j>iNv Mcausal(i,j),otherwise\widetilde M(i,j) = \begin{cases} -\infty, & i > N_v,\, j \le N_v,\, j > i-N_v \ M_{\mathrm{causal}}(i,j), & \text{otherwise} \end{cases}

where McausalM_{\mathrm{causal}} is the standard autoregressive mask.

Typical rotary position encoding (RoPE) indexes all tokens along a single axis, causing cross-modal relative offsets to drift as frames accumulate. TaYS instead assigns reasoning and visual tokens to independent positional axes:

  • pos(vs)=s\mathrm{pos}(v_s) = s
  • pos(rt)=t\mathrm{pos}(r_t) = t

The effect is a stable cross-modal RoPE, as the dot-product becomes dependent only on relative time offset tst-s between reasoning unit rtr_t and visual token vsv_s, regardless of prior sequence length.

4. Parallelized CoT Generation: Training and Stream-Parallel Inference

Training utilizes stream-constrained trajectories: at each frame tt, the model is provided with all visual tokens up to FtF_t and reasoning tokens so far, masked by streaming attention. The stream-constrained objective requires the model to emit RtR_t (the current reasoning segment) followed by <EOT>, ensuring no access to future frames during generation.

At inference, the dual KV-cache design enables near-zero decoder-level time-to-first-token (TTFT). Empirical TTFTs are:

  • Batch TTFT 10.6\approx 10.6 s (must encode all frames before decoding)
  • Interleaved TTFT 0.03\approx 0.03 s (decodes only after frame encoding)
  • TaYS TTFT 1e6\approx 1 \mathsf{e}{-6} s (encoding and decoding proceed fully in parallel)

TTFT in TaYS is given by

TTFT=Tencode(first frame)+Tdecode(first token)\mathrm{TTFT} = T_{\mathrm{encode}(\text{first frame})} + T_{\mathrm{decode}(\text{first token})}

with the terms overlapping completely, resulting in minimal latency.

5. Empirical Evaluation: Benchmarks, Accuracy, and Latency

Evaluation on an extended VideoEspresso benchmark (including event dynamics, causal reasoning, thematic understanding, cooking-process analysis, and traffic analysis) demonstrates that TaYS outperforms both batch and interleaved regimes.

Fine-tuning Qwen2.5-VL on identical streaming CoT trajectories yields:

Model Size Batch SFT Interleaved SFT TaYS SFT
3B 29.18% 33.96% 33.45%
7B 30.38% 34.98% 36.86%

In subjective GPT-5 rankings, TaYS provides the preferred output in 43.7% of comparisons (vs. 31.4% for Batch, 21.7% for Interleaved).

Latency, evaluated at 1–5 FPS, is as follows:

  • TaYS maintains decoder-level TTFT 1e6\approx 1 \mathsf{e}{-6} s and stable end-to-end delay 12\approx 12 s across frame rates.
  • Batch TTFT is fixed at 10.6\approx 10.6 s.
  • Interleaved delay increases with frame rate (up to 20 s at 5 FPS).

Temporal analyses indicate that TaYS's reasoning segments temporally peak within 0.69 s of annotated keyframes versus 1.52 s for the Interleaved regime. TaYS exhibits smoother semantic transitions, with fewer repeated peaks in reasoning similarity scores.

6. Architectural and Practical Implications

TaYS's streaming attention, modality-decoupled RoPE, and dual KV-cache design collectively enable a “think-while-watching” workflow in LVLMs. The framework achieves higher efficiency by allowing concurrent visual encoding and language token generation, providing both low-latency and temporally synchronized reasoning. This directly addresses applications in surveillance, robotics teleoperations, and live video analysis, where real-time responsiveness and temporally fine-grained reasoning are essential (Zhang et al., 3 Mar 2026).

A plausible implication is that further advances in streaming neural architectures will depend critically on cache separation and modality-aware positional encoding to achieve truly concurrent, low-latency multimodal reasoning.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Time Chain-of-Thought (Time-CoT).