Papers
Topics
Authors
Recent
Search
2000 character limit reached

Temporal CoT Prompting in Streaming LLMs

Updated 22 January 2026
  • Temporal Chain-of-Thought prompting is a dynamic method that incrementally updates reasoning in large language models under streaming and memory constraints.
  • It employs heuristic filtering based on correctness and depth to select and truncate prompt exemplars, ensuring compliance with strict token limits.
  • Experiments reveal that even with shallow or partially incorrect chains, the method achieves competitive performance compared to full-batch static strategies.

Temporal Chain-of-Thought (T-CoT) prompting refers to a formulation of Chain-of-Thought (CoT) prompting for LLMs under streaming or sequential batch conditions, in which data arrives incrementally and prompt construction must adapt as new batches are processed. Rather than assuming the availability of the entire test set for exemplar selection and rationale generation, T-CoT approaches treat the arrival order of batches as a streaming constraint on prompt maintenance and update. This paradigm captures a more realistic operational scenario for LLM deployment, where complete test data is not available a priori, and prompts must be updated dynamically while remaining within strict input length limitations (Tang, 2023).

1. Formalization of Streaming Batch CoT Prompting

The problem comprises a test set DD of size D|D|, partitioned into mm sequential batches, each with NN examples, to be processed by an LLM MM. The batch kk supplies questions q1(k),,qN(k)q_1^{(k)}, \ldots, q_N^{(k)} and starts with an initial fixed prompt P1P_1. At each batch step kk, the model generates CoT rationales ci(k)M(Pkqi(k))c_i^{(k)} \leftarrow M(P_k \,\|\, q_i^{(k)}), forms question–rationale pairs Sk={(qi(k)ci(k))}S_k = \{ (q_i^{(k)} \,\|\, c_i^{(k)} ) \}, and updates the prompt:

Pk+1=f(PkSk)P_{k+1} = f(P_k \mid S_k)

where ff is a black-box function subject to the constraint PkLmax|P_k| \leq L_{\text{max}} (model input length). The objective is to choose or learn ff so as to maximize final test accuracy (or another relevant metric) after all mm batches:

maxf Acc(M(Pm+1,D))  subject to  k, PkLmax\max_f ~ \operatorname{Acc}( M(P_{m+1}, D) ) ~~\text{subject to}~~ \forall k,~ |P_k| \leq L_{\text{max}}

This formulation grounds T-CoT in an incremental, memory-constrained regime that departs from conventional static or full-dataset CoT prompting (Tang, 2023).

2. Prompt Construction and Update Algorithms

In the streaming-batch setting, prompt construction requires sequential updating. The baseline (Auto-CoT) approach simply concatenates all new (qc)(q \,\|\, c) pairs into the prompt at each step:

Pk+1PkSkP_{k+1} \leftarrow P_k \,\|\, S_k

However, this method quickly breaches the LmaxL_{\text{max}} constraint. Empirical heuristics are employed to select which (qc)(q \,\|\, c) pairs are retained. Two principal criteria are investigated:

  • Correctness: Only retain ci(k)c_i^{(k)} that yield correct answers ("Correct-CoT"), or alternatively, intentionally retain >50% incorrect chains ("Wrong-CoT").
  • Depth (Rationale Length): Filter based on number of lines; "Deep-CoT" for rationales with #\#lines ξ\geq \xi, "Shallow-CoT" for #\#lines <ξ< \xi.

After selection, the prompt is truncated as needed to respect LmaxL_{\text{max}}. This procedure is summarized in the following pseudo-code:

1
2
3
4
5
6
for k = 1 to m:
    for i = 1 to N:
        c_i = M(P_k + q_i^(k))
    S_k = { (q_i^(k) + c_i) }
    S̃_k = select_subset(S_k; criterion = correctness or depth)
    P_{k+1} = truncate_to_max_length( P_k + S̃_k, L_max )
(Tang, 2023)

3. Temporal Structure and the Notion of “Temporal CoT”

Despite the temporal terminology, the approach does not introduce an explicit model of temporal dependency or time-decay across batches. The only modeled temporal aspect is the sequential index kk, with earlier batches contributing exemplars that persist or are discarded in subsequent prompt updates. There is no inter-batch memory, cross-batch relational modeling, or explicit tracking of temporal drift. The accumulation and pruning of exemplars is solely governed by input length and heuristic selection, rather than dynamic or learned temporal mechanisms. In effect, “temporal order” is equivalent to the sequential batch index and prompt accumulation under streaming constraints, not a learned or inferentially modeled temporal chain (Tang, 2023).

4. Experimental Setup and Quantitative Findings

Experiments are conducted using OpenAI text-davinci-002 on four datasets, each divided into 10 streaming batches: MultiArith (arithmetic, batch size 60), GSM8K (arithmetic, 64), StrategyQA (commonsense, 32), and Letter (symbolic, 81). Baselines include Zero-Shot-CoT (single “Let’s think step by step” prompt) and bootstrap Auto-CoT. Heuristic variants of ff are evaluated: Correct-CoT vs. Wrong-CoT, Deep-CoT vs. Shallow-CoT.

Main findings:

  • Wrong-CoT: Prompts containing more than half incorrect chain-of-thought examples suffer minimal degradation in performance compared to Correct-CoT.
  • Shallow-CoT: Shorter, shallower rationales outperform Deep-CoT, presumably due to reduced redundancy and token-efficiency under strict token budgets necessary as prompt size grows.
  • Both heuristics result in performance competitive with, and in certain cases surpass, the naive Auto-CoT baseline, while maintaining compliance with LmaxL_{\text{max}} (Tang, 2023).

5. Limitations and Potential Extensions

Substantial limitations characterize the current formulation:

  • No Learned Selection Strategy: The prompt update heuristic ff is hand-crafted based on correctness or rationale length. More general approaches could learn to score and select exemplars, potentially using a policy network or reinforcement learning to optimize final accuracy.
  • Lack of Inter-Batch Memory: Exemplar maintenance is limited to a flat list; there is no mechanism for time-decayed memory, clustering, or retrieval-based pools that could better capture distributional drift of incoming questions.
  • Static Heuristics: The correctness and depth thresholds are fixed and not adapted based on validation data or batch performance.
  • No Rich Temporal Modeling: The streaming framework does not exploit possible inter-chain or cross-batch dependencies that may arise in temporally drifting or context-evolving data. A more fully realized “temporal CoT” method would model such correlations, track latent state, or permit cross-batch reference (Tang, 2023).

A plausible implication is that extending streaming-batch CoT to incorporate learned, adaptive prompt update mechanisms and richer temporal dependencies could address current deficits and improve reasoning performance in dynamic, non-stationary environments.

6. Contextualization within Chain-of-Thought Prompting Research

The streaming batch setting underscores a practical distinction from previous CoT methods, where full test set visibility and offline prompt optimization are assumed. Prior works such as Auto-CoT employed static, full-batch selection strategies unsuited to incremental or deployment contexts. The streaming approach of Tang et al. foregrounds the challenge of balancing prompt informativeness, redundancy, and length within strict limits, while exhibiting that even minimal heuristic filtering can maintain—or in the case of shallow rationales, enhance—accuracy. These findings motivate further research into adaptive, temporally aware prompt maintenance for robust LLM reasoning in real-world continuous data settings (Tang, 2023).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Temporal Chain-of-Thought (T-CoT) Prompting.