Time-CoT: Streaming Video Reasoning

Updated 9 March 2026

Time-CoT is a streaming paradigm that enables real-time Chain-of-Thought reasoning in LVLMs, facilitating low-latency and temporally aligned multi-modal processing.
It employs a dual KV-cache mechanism to decouple visual encoding from textual decoding, drastically reducing the time-to-first-token compared to conventional batch approaches.
The framework uses streaming attention masks and modality-decoupled positional encodings to enforce temporal causality, enhancing performance in applications like surveillance and robotics.

Time Chain-of-Thought (Time-CoT), and specifically the "Think-as-You-See" (TaYS) framework, constitutes a streaming paradigm for Chain-of-Thought reasoning in Large Vision-LLMs (LVLMs). In contrast to traditional batch processing, where the entire video sequence must be available before any inference is possible, Time-CoT enables temporally aligned, low-latency reasoning by equipping LVLMs to interleave multi-modal perception and natural language reasoning in real time. The framework centers on aligned reasoning units, a dual KV-cache structure, streaming attention masks, modality-decoupled positional encodings, and parallelized CoT generation, collectively yielding improved accuracy and significantly reduced latency for video reasoning tasks (Zhang et al., 3 Mar 2026).

1. Streaming CoT Problem Formulation and Aligned Reasoning Units

Conventional Offline Video Chain-of-Thought (CoT) reasoning relies on presenting the full video $\mathcal V$ to the model encoder prior to decoding any language tokens:

$h_i = \mathrm{Decoder}(y_{<i};\,\mathrm{Enc}(\mathcal V))$

This batch-style approach is unsuitable for real-world, streaming scenarios. The streaming CoT regime instead restricts the encoder’s input to the prefix $\mathcal V_{\le t} = \{F_1,\dots,F_t\}$ available at current time $t$ , with reasoning outputs $h_i^t$ interleaved stepwise as frames are received:

$h_i^t = \mathrm{Decoder}(y_{<i}^t;\, \mathrm{Enc}(\mathcal V_{\le t}), C_{<t})$

At each step, sampled frames $F_t$ are associated with $(Q_t, R_t, A_t)$ triplets (Question, Reasoning, Answer), minimally segmented by an <EOT> delimiter. This enforces that at inference, the model emits temporally aligned reasoning units immediately following each frame’s arrival. The dataset for such streaming trajectory supervision is derived by temporally segmenting keyframe annotations.

2. Parallel Dual KV-Cache Mechanism

TaYS achieves true concurrency by decoupling visual encoding from textual decoding via a dual KV-cache:

$\mathcal C_v$ : a read-only video (visual) cache storing the encoded visual tokens of all arrived frames,
$\mathcal C_r$ : a growing text cache accumulating all previously generated reasoning tokens.

At each new frame $t$ :

$\mathcal C_v^{(t)} = \mathcal C_v^{(t-1)} \cup \mathrm{Enc}(F_t)$ (frame encoding is non-blocking and independent of reasoning state),
Context for decoding is constructed as $\mathrm{Merge}(\mathcal C_v^{(t)}, \mathcal C_r^{(t-1)})$ ,
$R_t \leftarrow \mathrm{Decoder}(\mathrm{Context})$ ,
$\mathcal C_r^{(t)} = \mathcal C_r^{(t-1)} \cup \mathrm{Dec}(R_t)$ .

Merge operations are pointer-based and do not require synchronization, allowing reasoning token generation to proceed without waiting for visual cache updates. In contrast, interleaved single-cache paradigms stall token generation until new visual embeddings are available.

3. Streaming Attention Masks and Modality-Decoupled Positional Encodings

Temporal causality is enforced through a two-part streaming attention mask $\widetilde M(i,j)$ over the joint sequence:

For reasoning token $i > N_v$ attending to visual token $j \le N_v$ , access to frames is restricted via $j \le i - N_v$ :

$\widetilde M(i,j) = \begin{cases} -\infty, & i > N_v,\, j \le N_v,\, j > i-N_v \ M_{\mathrm{causal}}(i,j), & \text{otherwise} \end{cases}$

where $M_{\mathrm{causal}}$ is the standard autoregressive mask.

Typical rotary position encoding (RoPE) indexes all tokens along a single axis, causing cross-modal relative offsets to drift as frames accumulate. TaYS instead assigns reasoning and visual tokens to independent positional axes:

$\mathrm{pos}(v_s) = s$
$\mathrm{pos}(r_t) = t$

The effect is a stable cross-modal RoPE, as the dot-product becomes dependent only on relative time offset $t-s$ between reasoning unit $r_t$ and visual token $v_s$ , regardless of prior sequence length.

4. Parallelized CoT Generation: Training and Stream-Parallel Inference

Training utilizes stream-constrained trajectories: at each frame $t$ , the model is provided with all visual tokens up to $F_t$ and reasoning tokens so far, masked by streaming attention. The stream-constrained objective requires the model to emit $R_t$ (the current reasoning segment) followed by <EOT>, ensuring no access to future frames during generation.

At inference, the dual KV-cache design enables near-zero decoder-level time-to-first-token (TTFT). Empirical TTFTs are:

Batch TTFT $\approx 10.6$ s (must encode all frames before decoding)
Interleaved TTFT $\approx 0.03$ s (decodes only after frame encoding)
TaYS TTFT $\approx 1 \mathsf{e}{-6}$ s (encoding and decoding proceed fully in parallel)

TTFT in TaYS is given by

$\mathrm{TTFT} = T_{\mathrm{encode}(\text{first frame})} + T_{\mathrm{decode}(\text{first token})}$

with the terms overlapping completely, resulting in minimal latency.

5. Empirical Evaluation: Benchmarks, Accuracy, and Latency

Evaluation on an extended VideoEspresso benchmark (including event dynamics, causal reasoning, thematic understanding, cooking-process analysis, and traffic analysis) demonstrates that TaYS outperforms both batch and interleaved regimes.

Fine-tuning Qwen2.5-VL on identical streaming CoT trajectories yields:

Model Size	Batch SFT	Interleaved SFT	TaYS SFT
3B	29.18%	33.96%	33.45%
7B	30.38%	34.98%	36.86%

In subjective GPT-5 rankings, TaYS provides the preferred output in 43.7% of comparisons (vs. 31.4% for Batch, 21.7% for Interleaved).

Latency, evaluated at 1–5 FPS, is as follows:

TaYS maintains decoder-level TTFT $\approx 1 \mathsf{e}{-6}$ s and stable end-to-end delay $\approx 12$ s across frame rates.
Batch TTFT is fixed at $\approx 10.6$ s.
Interleaved delay increases with frame rate (up to 20 s at 5 FPS).

Temporal analyses indicate that TaYS's reasoning segments temporally peak within 0.69 s of annotated keyframes versus 1.52 s for the Interleaved regime. TaYS exhibits smoother semantic transitions, with fewer repeated peaks in reasoning similarity scores.

6. Architectural and Practical Implications

TaYS's streaming attention, modality-decoupled RoPE, and dual KV-cache design collectively enable a “think-while-watching” workflow in LVLMs. The framework achieves higher efficiency by allowing concurrent visual encoding and language token generation, providing both low-latency and temporally synchronized reasoning. This directly addresses applications in surveillance, robotics teleoperations, and live video analysis, where real-time responsiveness and temporally fine-grained reasoning are essential (Zhang et al., 3 Mar 2026).

A plausible implication is that further advances in streaming neural architectures will depend critically on cache separation and modality-aware positional encoding to achieve truly concurrent, low-latency multimodal reasoning.

Markdown Report Issue Upgrade to Chat

References (1)

Think-as-You-See: Streaming Chain-of-Thought Reasoning for Large Vision-Language Models (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Time Chain-of-Thought (Time-CoT).

Time-CoT: Streaming Video Reasoning

1. Streaming CoT Problem Formulation and Aligned Reasoning Units

2. Parallel Dual KV-Cache Mechanism

3. Streaming Attention Masks and Modality-Decoupled Positional Encodings

4. Parallelized CoT Generation: Training and Stream-Parallel Inference

5. Empirical Evaluation: Benchmarks, Accuracy, and Latency

6. Architectural and Practical Implications

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Time-CoT: Streaming Video Reasoning

1. Streaming CoT Problem Formulation and Aligned Reasoning Units

2. Parallel Dual KV-Cache Mechanism

3. Streaming Attention Masks and Modality-Decoupled Positional Encodings

4. Parallelized CoT Generation: Training and Stream-Parallel Inference

5. Empirical Evaluation: Benchmarks, Accuracy, and Latency

6. Architectural and Practical Implications

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research