Time-CoT: Streaming Video Reasoning
- Time-CoT is a streaming paradigm that enables real-time Chain-of-Thought reasoning in LVLMs, facilitating low-latency and temporally aligned multi-modal processing.
- It employs a dual KV-cache mechanism to decouple visual encoding from textual decoding, drastically reducing the time-to-first-token compared to conventional batch approaches.
- The framework uses streaming attention masks and modality-decoupled positional encodings to enforce temporal causality, enhancing performance in applications like surveillance and robotics.
Time Chain-of-Thought (Time-CoT), and specifically the "Think-as-You-See" (TaYS) framework, constitutes a streaming paradigm for Chain-of-Thought reasoning in Large Vision-LLMs (LVLMs). In contrast to traditional batch processing, where the entire video sequence must be available before any inference is possible, Time-CoT enables temporally aligned, low-latency reasoning by equipping LVLMs to interleave multi-modal perception and natural language reasoning in real time. The framework centers on aligned reasoning units, a dual KV-cache structure, streaming attention masks, modality-decoupled positional encodings, and parallelized CoT generation, collectively yielding improved accuracy and significantly reduced latency for video reasoning tasks (Zhang et al., 3 Mar 2026).
1. Streaming CoT Problem Formulation and Aligned Reasoning Units
Conventional Offline Video Chain-of-Thought (CoT) reasoning relies on presenting the full video to the model encoder prior to decoding any language tokens:
This batch-style approach is unsuitable for real-world, streaming scenarios. The streaming CoT regime instead restricts the encoder’s input to the prefix available at current time , with reasoning outputs interleaved stepwise as frames are received:
At each step, sampled frames are associated with triplets (Question, Reasoning, Answer), minimally segmented by an <EOT> delimiter. This enforces that at inference, the model emits temporally aligned reasoning units immediately following each frame’s arrival. The dataset for such streaming trajectory supervision is derived by temporally segmenting keyframe annotations.
2. Parallel Dual KV-Cache Mechanism
TaYS achieves true concurrency by decoupling visual encoding from textual decoding via a dual KV-cache:
- : a read-only video (visual) cache storing the encoded visual tokens of all arrived frames,
- : a growing text cache accumulating all previously generated reasoning tokens.
At each new frame :
- (frame encoding is non-blocking and independent of reasoning state),
- Context for decoding is constructed as ,
- ,
- .
Merge operations are pointer-based and do not require synchronization, allowing reasoning token generation to proceed without waiting for visual cache updates. In contrast, interleaved single-cache paradigms stall token generation until new visual embeddings are available.
3. Streaming Attention Masks and Modality-Decoupled Positional Encodings
Temporal causality is enforced through a two-part streaming attention mask over the joint sequence:
- For reasoning token attending to visual token , access to frames is restricted via :
where is the standard autoregressive mask.
Typical rotary position encoding (RoPE) indexes all tokens along a single axis, causing cross-modal relative offsets to drift as frames accumulate. TaYS instead assigns reasoning and visual tokens to independent positional axes:
The effect is a stable cross-modal RoPE, as the dot-product becomes dependent only on relative time offset between reasoning unit and visual token , regardless of prior sequence length.
4. Parallelized CoT Generation: Training and Stream-Parallel Inference
Training utilizes stream-constrained trajectories: at each frame , the model is provided with all visual tokens up to and reasoning tokens so far, masked by streaming attention. The stream-constrained objective requires the model to emit (the current reasoning segment) followed by <EOT>, ensuring no access to future frames during generation.
At inference, the dual KV-cache design enables near-zero decoder-level time-to-first-token (TTFT). Empirical TTFTs are:
- Batch TTFT s (must encode all frames before decoding)
- Interleaved TTFT s (decodes only after frame encoding)
- TaYS TTFT s (encoding and decoding proceed fully in parallel)
TTFT in TaYS is given by
with the terms overlapping completely, resulting in minimal latency.
5. Empirical Evaluation: Benchmarks, Accuracy, and Latency
Evaluation on an extended VideoEspresso benchmark (including event dynamics, causal reasoning, thematic understanding, cooking-process analysis, and traffic analysis) demonstrates that TaYS outperforms both batch and interleaved regimes.
Fine-tuning Qwen2.5-VL on identical streaming CoT trajectories yields:
| Model Size | Batch SFT | Interleaved SFT | TaYS SFT |
|---|---|---|---|
| 3B | 29.18% | 33.96% | 33.45% |
| 7B | 30.38% | 34.98% | 36.86% |
In subjective GPT-5 rankings, TaYS provides the preferred output in 43.7% of comparisons (vs. 31.4% for Batch, 21.7% for Interleaved).
Latency, evaluated at 1–5 FPS, is as follows:
- TaYS maintains decoder-level TTFT s and stable end-to-end delay s across frame rates.
- Batch TTFT is fixed at s.
- Interleaved delay increases with frame rate (up to 20 s at 5 FPS).
Temporal analyses indicate that TaYS's reasoning segments temporally peak within 0.69 s of annotated keyframes versus 1.52 s for the Interleaved regime. TaYS exhibits smoother semantic transitions, with fewer repeated peaks in reasoning similarity scores.
6. Architectural and Practical Implications
TaYS's streaming attention, modality-decoupled RoPE, and dual KV-cache design collectively enable a “think-while-watching” workflow in LVLMs. The framework achieves higher efficiency by allowing concurrent visual encoding and language token generation, providing both low-latency and temporally synchronized reasoning. This directly addresses applications in surveillance, robotics teleoperations, and live video analysis, where real-time responsiveness and temporally fine-grained reasoning are essential (Zhang et al., 3 Mar 2026).
A plausible implication is that further advances in streaming neural architectures will depend critically on cache separation and modality-aware positional encoding to achieve truly concurrent, low-latency multimodal reasoning.