Temporal CoT Prompting in Streaming LLMs
- Temporal Chain-of-Thought prompting is a dynamic method that incrementally updates reasoning in large language models under streaming and memory constraints.
- It employs heuristic filtering based on correctness and depth to select and truncate prompt exemplars, ensuring compliance with strict token limits.
- Experiments reveal that even with shallow or partially incorrect chains, the method achieves competitive performance compared to full-batch static strategies.
Temporal Chain-of-Thought (T-CoT) prompting refers to a formulation of Chain-of-Thought (CoT) prompting for LLMs under streaming or sequential batch conditions, in which data arrives incrementally and prompt construction must adapt as new batches are processed. Rather than assuming the availability of the entire test set for exemplar selection and rationale generation, T-CoT approaches treat the arrival order of batches as a streaming constraint on prompt maintenance and update. This paradigm captures a more realistic operational scenario for LLM deployment, where complete test data is not available a priori, and prompts must be updated dynamically while remaining within strict input length limitations (Tang, 2023).
1. Formalization of Streaming Batch CoT Prompting
The problem comprises a test set of size , partitioned into sequential batches, each with examples, to be processed by an LLM . The batch supplies questions and starts with an initial fixed prompt . At each batch step , the model generates CoT rationales , forms question–rationale pairs , and updates the prompt:
where is a black-box function subject to the constraint (model input length). The objective is to choose or learn so as to maximize final test accuracy (or another relevant metric) after all batches:
This formulation grounds T-CoT in an incremental, memory-constrained regime that departs from conventional static or full-dataset CoT prompting (Tang, 2023).
2. Prompt Construction and Update Algorithms
In the streaming-batch setting, prompt construction requires sequential updating. The baseline (Auto-CoT) approach simply concatenates all new pairs into the prompt at each step:
However, this method quickly breaches the constraint. Empirical heuristics are employed to select which pairs are retained. Two principal criteria are investigated:
- Correctness: Only retain that yield correct answers ("Correct-CoT"), or alternatively, intentionally retain >50% incorrect chains ("Wrong-CoT").
- Depth (Rationale Length): Filter based on number of lines; "Deep-CoT" for rationales with lines , "Shallow-CoT" for lines .
After selection, the prompt is truncated as needed to respect . This procedure is summarized in the following pseudo-code:
1 2 3 4 5 6 |
for k = 1 to m: for i = 1 to N: c_i = M(P_k + q_i^(k)) S_k = { (q_i^(k) + c_i) } S̃_k = select_subset(S_k; criterion = correctness or depth) P_{k+1} = truncate_to_max_length( P_k + S̃_k, L_max ) |
3. Temporal Structure and the Notion of “Temporal CoT”
Despite the temporal terminology, the approach does not introduce an explicit model of temporal dependency or time-decay across batches. The only modeled temporal aspect is the sequential index , with earlier batches contributing exemplars that persist or are discarded in subsequent prompt updates. There is no inter-batch memory, cross-batch relational modeling, or explicit tracking of temporal drift. The accumulation and pruning of exemplars is solely governed by input length and heuristic selection, rather than dynamic or learned temporal mechanisms. In effect, “temporal order” is equivalent to the sequential batch index and prompt accumulation under streaming constraints, not a learned or inferentially modeled temporal chain (Tang, 2023).
4. Experimental Setup and Quantitative Findings
Experiments are conducted using OpenAI text-davinci-002 on four datasets, each divided into 10 streaming batches: MultiArith (arithmetic, batch size 60), GSM8K (arithmetic, 64), StrategyQA (commonsense, 32), and Letter (symbolic, 81). Baselines include Zero-Shot-CoT (single “Let’s think step by step” prompt) and bootstrap Auto-CoT. Heuristic variants of are evaluated: Correct-CoT vs. Wrong-CoT, Deep-CoT vs. Shallow-CoT.
Main findings:
- Wrong-CoT: Prompts containing more than half incorrect chain-of-thought examples suffer minimal degradation in performance compared to Correct-CoT.
- Shallow-CoT: Shorter, shallower rationales outperform Deep-CoT, presumably due to reduced redundancy and token-efficiency under strict token budgets necessary as prompt size grows.
- Both heuristics result in performance competitive with, and in certain cases surpass, the naive Auto-CoT baseline, while maintaining compliance with (Tang, 2023).
5. Limitations and Potential Extensions
Substantial limitations characterize the current formulation:
- No Learned Selection Strategy: The prompt update heuristic is hand-crafted based on correctness or rationale length. More general approaches could learn to score and select exemplars, potentially using a policy network or reinforcement learning to optimize final accuracy.
- Lack of Inter-Batch Memory: Exemplar maintenance is limited to a flat list; there is no mechanism for time-decayed memory, clustering, or retrieval-based pools that could better capture distributional drift of incoming questions.
- Static Heuristics: The correctness and depth thresholds are fixed and not adapted based on validation data or batch performance.
- No Rich Temporal Modeling: The streaming framework does not exploit possible inter-chain or cross-batch dependencies that may arise in temporally drifting or context-evolving data. A more fully realized “temporal CoT” method would model such correlations, track latent state, or permit cross-batch reference (Tang, 2023).
A plausible implication is that extending streaming-batch CoT to incorporate learned, adaptive prompt update mechanisms and richer temporal dependencies could address current deficits and improve reasoning performance in dynamic, non-stationary environments.
6. Contextualization within Chain-of-Thought Prompting Research
The streaming batch setting underscores a practical distinction from previous CoT methods, where full test set visibility and offline prompt optimization are assumed. Prior works such as Auto-CoT employed static, full-batch selection strategies unsuited to incremental or deployment contexts. The streaming approach of Tang et al. foregrounds the challenge of balancing prompt informativeness, redundancy, and length within strict limits, while exhibiting that even minimal heuristic filtering can maintain—or in the case of shallow rationales, enhance—accuracy. These findings motivate further research into adaptive, temporally aware prompt maintenance for robust LLM reasoning in real-world continuous data settings (Tang, 2023).