Papers
Topics
Authors
Recent
2000 character limit reached

CoT-Collection Dataset Overview

Updated 14 November 2025
  • CoT-Collection is a large-scale, instruction-tuning dataset featuring over 1.84 million examples and 1,060 tasks across 26 diverse families.
  • The dataset utilizes a hybrid human-curated and model-driven pipeline, employing Codex for generating chain-of-thought rationales with rigorous filtering.
  • It enhances zero-shot and few-shot reasoning in language models through explicit CoT supervision and efficient fine-tuning protocols like LoRA.

The CoT-Collection Dataset is a large-scale, instruction-tuning resource designed to equip LLMs with explicit chain-of-thought (CoT) reasoning capabilities across a broad spectrum of tasks. It was introduced to address the inherent limitations of smaller (<100B parameter) LLMs in generating multi-step rationales, enabling these models to approach the zero-shot and few-shot reasoning capacities heretofore exclusive to much larger models. The dataset serves both as a pretraining corpus for intermediate rational supervision and as a benchmark for research on reasoning transfer, task generalization, and template learning behaviors in natural language processing.

1. Dataset Composition and Structure

The CoT-Collection comprises 1,060 distinct instruction-formatted tasks, covering 1.84 million examples, each annotated with 1–5 CoT rationales. Tasks are distributed among 26 “families,” including multiple-choice QA, extractive QA, arithmetic word problems, commonsense reasoning, natural-language inference, symbolic logic, list-manipulation, dialogue, and code-oriented domains. Unlike the original Flan Collection, which included only nine CoT-augmented tasks, the CoT-Collection offers comprehensive CoT coverage across domains.

Table: Task Family Distribution (Selected)

Family #Tasks #Examples
Multi-choice QA 220 420K
Extractive QA 180 300K
Arithmetic Reasoning 105 200K
Commonsense (SNI) 150 250K
NLI (FLAN) 80 150K
Logic & Symbolic 75 120K
Dialogue/Code/Others 250 400K

Each dataset example is a JSONL record with fields: {task_id, instruction, input, chain_of_thought, answer} stored in ten ≈184K-instance shards. Examples are constructed using prompt instruction, input (possibly empty), CoT rationale (with explicit answer token), and ground-truth answer.

2. Annotation and Generation Pipeline

Annotation follows a unified, partly human-curated and largely model-based process:

  1. Task family grouping: Each task is associated with a demonstration set DT_k (typically 6–8 high-quality, hand-written rationales by authors).
  2. Rationale generation: OpenAI Codex (code-davinci-002) is prompted with the "Let's think step by step" phrase in few-shot in-context style, conditioning on ground-truth label.
  3. Decoding: Nucleus (top-p=0.8) sampling and no-repeat ngram enforcement, generating five rationale candidates per example.
  4. Filtering: Post-processing retains rationales containing the gold answer, with length <512 tokens, unique content, and absence of code degeneration or repeated sentences.

Manual A/B testing and filtering are performed to ensure fluency and informativeness of initial demonstration rationales; all further rationale generation is model-driven according to the described criteria.

3. Usage and Model Fine-Tuning Protocol

For supervised fine-tuning, representative code examples are provided:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
from datasets import load_dataset
ds = load_dataset('json', data_files='cot_collection/*.jsonl', split='train')
def preprocess(ex):
    prompt = ex['instruction'] + '\n' + ex['input'] + '\nLet</p>

<p>Fine-tuning is executed with hyperparametersFlan-T5-3B: AdamW, batch 64, lr=5e-5, accumulation 8, 1 epoch; Flan-T5-11B: Adafactor, batch 8, lr=1e-4. Mini-batches should be sampled across sources to ensure distributional uniformity (e.g., FLAN 23.9%, P3 30.9%, SNI 25.5%, etc.). Efficient <a href="https://www.emergentmind.com/topics/few-shot-adaptation" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data  x-tooltip.raw="">few-shot adaptation</a> is possible via <a href="https://www.emergentmind.com/topics/geometry-aware-low-rank-adaptation-lora" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data  x-tooltip.raw="">LoRA</a> (rank=4).</p>
<h2 class='paper-heading' id='evaluation-metrics-and-results'>4. Evaluation, Metrics, and Results</h2>
<p>CoT-Collection enables both zero-shot and few-shot evaluations. The primary metric is accuracy:</p>

<p><span class="katex-display"><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><semantics><mrow><mtext>Accuracy</mtext><mo>=</mo><mfrac><mrow><mi mathvariant="normal">#</mi><mtext>Correct</mtext></mrow><mrow><mi mathvariant="normal">#</mi><mtext>Total</mtext></mrow></mfrac><mo>×</mo><mn>100</mn><mi mathvariant="normal">%</mi></mrow><annotation encoding="application/x-tex">\text{Accuracy} = \frac{\#\text{Correct}}{\#\text{Total}} \times 100\%</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.8778em;vertical-align:-0.1944em;"></span><span class="mord text"><span class="mord">Accuracy</span></span><span class="mspace" style="margin-right:0.2778em;"></span><span class="mrel">=</span><span class="mspace" style="margin-right:0.2778em;"></span></span><span class="base"><span class="strut" style="height:2.2519em;vertical-align:-0.8804em;"></span><span class="mord"><span class="mopen nulldelimiter"></span><span class="mfrac"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:1.3714em;"><span style="top:-2.314em;"><span class="pstrut" style="height:3em;"></span><span class="mord"><span class="mord">#</span><span class="mord text"><span class="mord">Total</span></span></span></span><span style="top:-3.23em;"><span class="pstrut" style="height:3em;"></span><span class="frac-line" style="border-bottom-width:0.04em;"></span></span><span style="top:-3.677em;"><span class="pstrut" style="height:3em;"></span><span class="mord"><span class="mord">#</span><span class="mord text"><span class="mord">Correct</span></span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.8804em;"><span></span></span></span></span></span><span class="mclose nulldelimiter"></span></span><span class="mspace" style="margin-right:0.2222em;"></span><span class="mbin">×</span><span class="mspace" style="margin-right:0.2222em;"></span></span><span class="base"><span class="strut" style="height:0.8056em;vertical-align:-0.0556em;"></span><span class="mord">100%</span></span></span></span></span></p>

<p><span class="katex-display"><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><semantics><mrow><mi mathvariant="normal">Δ</mi><mtext>Accuracy</mtext><mo>=</mo><msub><mtext>Accuracy</mtext><mtext>CoT-tuned</mtext></msub><mo></mo><msub><mtext>Accuracy</mtext><mtext>base</mtext></msub></mrow><annotation encoding="application/x-tex">\Delta \text{Accuracy} = \text{Accuracy}_{\text{CoT-tuned}} - \text{Accuracy}_{\text{base}}</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.8778em;vertical-align:-0.1944em;"></span><span class="mord">Δ</span><span class="mord text"><span class="mord">Accuracy</span></span><span class="mspace" style="margin-right:0.2778em;"></span><span class="mrel">=</span><span class="mspace" style="margin-right:0.2778em;"></span></span><span class="base"><span class="strut" style="height:0.9275em;vertical-align:-0.2441em;"></span><span class="mord"><span class="mord text"><span class="mord">Accuracy</span></span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.242em;"><span style="top:-2.4559em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord text mtight"><span class="mord mtight">CoT-tuned</span></span></span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.2441em;"><span></span></span></span></span></span></span><span class="mspace" style="margin-right:0.2222em;"></span><span class="mbin"></span><span class="mspace" style="margin-right:0.2222em;"></span></span><span class="base"><span class="strut" style="height:0.9275em;vertical-align:-0.2441em;"></span><span class="mord"><span class="mord text"><span class="mord">Accuracy</span></span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.242em;"><span style="top:-2.4559em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord text mtight"><span class="mord mtight">base</span></span></span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.2441em;"><span></span></span></span></span></span></span></span></span></span></span></p>

<p>On BIG-Bench-Hard (27 tasks):</p>
<div class='overflow-x-auto max-w-full my-4'><table class='table border-collapse w-full' style='table-layout: fixed'><thead><tr>
<th>Model</th>
<th>Direct</th>
<th>CoT Eval</th>
<th>ΔCoT vs Flan</th>
</tr>
</thead><tbody><tr>
<td>Flan-T5-3B</td>
<td>37.1%</td>
<td>34.1%</td>
<td></td>
</tr>
<tr>
<td>CoT-T5-3B (Ours)</td>
<td>36.2%</td>
<td>38.4%</td>
<td>+4.34%</td>
</tr>
<tr>
<td>Flan-T5-11B</td>
<td>41.0%</td>
<td>38.6%</td>
<td></td>
</tr>
<tr>
<td>CoT-T5-11B (Ours)</td>
<td>42.6%</td>
<td>42.2%</td>
<td>+2.60%</td>
</tr>
</tbody></table></div>
<p>Few-shot (64-shot, 4 domains; using LoRA):</p>
<div class='overflow-x-auto max-w-full my-4'><table class='table border-collapse w-full' style='table-layout: fixed'><thead><tr>
<th>Model</th>
<th>#Params</th>
<th>Avg Acc</th>
<th>Δ vs. Flan</th>
</tr>
</thead><tbody><tr>
<td>Flan-T5-3B full FT</td>
<td>2.8B</td>
<td>61.78%</td>
<td></td>
</tr>
<tr>
<td>CoT-T5-3B + LoRA CoT-FT</td>
<td>2.35M</td>
<td>64.02%</td>
<td>+2.24%</td>
</tr>
<tr>
<td>Flan-T5-11B + LoRA FT</td>
<td>4.72M</td>
<td>66.59%</td>
<td></td>
</tr>
<tr>
<td>CoT-T5-11B + LoRA CoT-FT</td>
<td>4.72M</td>
<td>68.96%</td>
<td>+2.37%</td>
</tr>
<tr>
<td>ChatGPT+ICL (64 demos)</td>
<td></td>
<td>54.98%</td>
<td></td>
</tr>
</tbody></table></div>
<p>Models fine-tuned on the CoT-Collection attain better zero-shot generalization and show marked improvements over equivalent-size, standard-flan-trained and ICL-only baselines.</p>
<h2 class='paper-heading' id='scientific-insights-and-best-practices'>5. Scientific Insights and Best Practices</h2>
<p>Key findings:</p>

<ul>
<li>Explicit CoT supervision is necessary for teaching how to decompose and solve complex, multi-step problems; pure <a href="https://www.emergentmind.com/topics/in-context-learning" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data  x-tooltip.raw="">in-context learning</a> with CoT prompts is insufficient for smaller models.</li>
<li>Diversity of task types in training, more so than volume, is essential for robust reasoning transfer; 10,000 diverse CoT examples from 1,060 tasks yield better generalization than 180,000 examples over just nine tasks.</li>
<li>Positive transfer is observed across task families, with no visible <a href="https://www.emergentmind.com/topics/catastrophic-forgetting" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data  x-tooltip.raw="">catastrophic forgetting</a> on reused tasks.</li>
<li>LoRA adaptation enables efficient few-shot CoT learning with minimal parameter overhead.</li>
<li>Filtering rationales for answer-presence, brevity, and non-degeneration is a practical necessity; code-based filters are included in the release.</li>
</ul>
<h2 class='paper-heading' id='limitations-and-directions-for-extension'>6. Limitations and Directions for Extension</h2>
<p>CoT-Collection is exclusively English; current multilingual <a href="https://www.emergentmind.com/topics/zero-shot-performance" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data  x-tooltip.raw="">zero-shot performance</a> is near zero for Korean, Chinese, and Japanese. Rationale generation depends on the (proprietary) Codex modelfuture iterations may leverage open-source LLMs or more advanced multi-path rationales such as "Tree of Thoughts". It is recommended to extend to non-English settings, and to experiment with other approaches to CoT rationalization and evaluation protocols.</p>

<p>Reliance on human-crafted demonstrations for each family is a critical design choiceexpanding this to cover more instruction types or to support semi-automatic CoT generation in new domains would be a natural extension.</p>
<h2 class='paper-heading' id='availability-and-impact'>7. Availability and Impact</h2>
<p>The dataset, code, and model checkpoints are publicly accessible. CoT-Collection is, to date, the largest and most diverse open instruction-tuning resource with CoT supervision for LLM reasoning. It serves as a reference corpus for model pretraining, as a basis for research in reasoning and transfer, and as a framework for future advances in medium- and low-parameter LM chain-of-thought generalization (<a href="/papers/2305.14045" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data  x-tooltip.raw="">Kim et al., 2023</a>).</p>s think step by step.'
    return {'prompt': prompt, 'labels': ex['chain_of_thought'] + ' [ANSWER] ' + ex['answer']}
train_ds = ds.map(preprocess, remove_columns=ds.column_names)

Fine-tuning is executed with hyperparameters—Flan-T5-3B: AdamW, batch 64, lr=5e-5, accumulation 8, 1 epoch; Flan-T5-11B: Adafactor, batch 8, lr=1e-4. Mini-batches should be sampled across sources to ensure distributional uniformity (e.g., FLAN 23.9%, P3 30.9%, SNI 25.5%, etc.). Efficient few-shot adaptation is possible via LoRA (rank=4).

4. Evaluation, Metrics, and Results

CoT-Collection enables both zero-shot and few-shot evaluations. The primary metric is accuracy:

Accuracy=#Correct#Total×100%\text{Accuracy} = \frac{\#\text{Correct}}{\#\text{Total}} \times 100\%

ΔAccuracy=AccuracyCoT-tunedAccuracybase\Delta \text{Accuracy} = \text{Accuracy}_{\text{CoT-tuned}} - \text{Accuracy}_{\text{base}}

On BIG-Bench-Hard (27 tasks):

Model Direct CoT Eval ΔCoT vs Flan
Flan-T5-3B 37.1% 34.1%
CoT-T5-3B (Ours) 36.2% 38.4% +4.34%
Flan-T5-11B 41.0% 38.6%
CoT-T5-11B (Ours) 42.6% 42.2% +2.60%

Few-shot (64-shot, 4 domains; using LoRA):

Model #Params Avg Acc Δ vs. Flan
Flan-T5-3B full FT 2.8B 61.78%
CoT-T5-3B + LoRA CoT-FT 2.35M 64.02% +2.24%
Flan-T5-11B + LoRA FT 4.72M 66.59%
CoT-T5-11B + LoRA CoT-FT 4.72M 68.96% +2.37%
ChatGPT+ICL (64 demos) 54.98%

Models fine-tuned on the CoT-Collection attain better zero-shot generalization and show marked improvements over equivalent-size, standard-flan-trained and ICL-only baselines.

5. Scientific Insights and Best Practices

Key findings:

  • Explicit CoT supervision is necessary for teaching how to decompose and solve complex, multi-step problems; pure in-context learning with CoT prompts is insufficient for smaller models.
  • Diversity of task types in training, more so than volume, is essential for robust reasoning transfer; 10,000 diverse CoT examples from 1,060 tasks yield better generalization than 180,000 examples over just nine tasks.
  • Positive transfer is observed across task families, with no visible catastrophic forgetting on reused tasks.
  • LoRA adaptation enables efficient few-shot CoT learning with minimal parameter overhead.
  • Filtering rationales for answer-presence, brevity, and non-degeneration is a practical necessity; code-based filters are included in the release.

6. Limitations and Directions for Extension

CoT-Collection is exclusively English; current multilingual zero-shot performance is near zero for Korean, Chinese, and Japanese. Rationale generation depends on the (proprietary) Codex model—future iterations may leverage open-source LLMs or more advanced multi-path rationales such as "Tree of Thoughts". It is recommended to extend to non-English settings, and to experiment with other approaches to CoT rationalization and evaluation protocols.

Reliance on human-crafted demonstrations for each family is a critical design choice—expanding this to cover more instruction types or to support semi-automatic CoT generation in new domains would be a natural extension.

7. Availability and Impact

The dataset, code, and model checkpoints are publicly accessible. CoT-Collection is, to date, the largest and most diverse open instruction-tuning resource with CoT supervision for LLM reasoning. It serves as a reference corpus for model pretraining, as a basis for research in reasoning and transfer, and as a framework for future advances in medium- and low-parameter LM chain-of-thought generalization (Kim et al., 2023).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to CoT-Collection Dataset.