CoT-Collection Dataset Overview

Updated 14 November 2025

CoT-Collection is a large-scale, instruction-tuning dataset featuring over 1.84 million examples and 1,060 tasks across 26 diverse families.
The dataset utilizes a hybrid human-curated and model-driven pipeline, employing Codex for generating chain-of-thought rationales with rigorous filtering.
It enhances zero-shot and few-shot reasoning in language models through explicit CoT supervision and efficient fine-tuning protocols like LoRA.

The CoT-Collection Dataset is a large-scale, instruction-tuning resource designed to equip LLMs with explicit chain-of-thought (CoT) reasoning capabilities across a broad spectrum of tasks. It was introduced to address the inherent limitations of smaller (<100B parameter) LLMs in generating multi-step rationales, enabling these models to approach the zero-shot and few-shot reasoning capacities heretofore exclusive to much larger models. The dataset serves both as a pretraining corpus for intermediate rational supervision and as a benchmark for research on reasoning transfer, task generalization, and template learning behaviors in natural language processing.

1. Dataset Composition and Structure

The CoT-Collection comprises 1,060 distinct instruction-formatted tasks, covering 1.84 million examples, each annotated with 1–5 CoT rationales. Tasks are distributed among 26 “families,” including multiple-choice QA, extractive QA, arithmetic word problems, commonsense reasoning, natural-language inference, symbolic logic, list-manipulation, dialogue, and code-oriented domains. Unlike the original Flan Collection, which included only nine CoT-augmented tasks, the CoT-Collection offers comprehensive CoT coverage across domains.

Table: Task Family Distribution (Selected)

Family	#Tasks	#Examples
Multi-choice QA	220	420K
Extractive QA	180	300K
Arithmetic Reasoning	105	200K
Commonsense (SNI)	150	250K
NLI (FLAN)	80	150K
Logic & Symbolic	75	120K
Dialogue/Code/Others	250	400K

Each dataset example is a JSONL record with fields: {task_id, instruction, input, chain_of_thought, answer} stored in ten ≈184K-instance shards. Examples are constructed using prompt instruction, input (possibly empty), CoT rationale (with explicit answer token), and ground-truth answer.

2. Annotation and Generation Pipeline

Annotation follows a unified, partly human-curated and largely model-based process:

Task family grouping: Each task is associated with a demonstration set D^T_k (typically 6–8 high-quality, hand-written rationales by authors).
Rationale generation: OpenAI Codex (code-davinci-002) is prompted with the "Let's think step by step" phrase in few-shot in-context style, conditioning on ground-truth label.
Decoding: Nucleus (top-p=0.8) sampling and no-repeat ngram enforcement, generating five rationale candidates per example.
Filtering: Post-processing retains rationales containing the gold answer, with length <512 tokens, unique content, and absence of code degeneration or repeated sentences.

Manual A/B testing and filtering are performed to ensure fluency and informativeness of initial demonstration rationales; all further rationale generation is model-driven according to the described criteria.

3. Usage and Model Fine-Tuning Protocol

For supervised fine-tuning, representative code examples are provided:

from datasets import load_dataset
ds = load_dataset('json', data_files='cot_collection/*.jsonl', split='train')
def preprocess(ex):
    prompt = ex['instruction'] + '\n' + ex['input'] + '\nLet</p>

<p>Fine-tuning is executed with hyperparameters—Flan-T5-3B: AdamW, batch 64, lr=5e-5, accumulation 8, 1 epoch; Flan-T5-11B: Adafactor, batch 8, lr=1e-4. Mini-batches should be sampled across sources to ensure distributional uniformity (e.g., FLAN 23.9%, P3 30.9%, SNI 25.5%, etc.). Efficient <a href="https://www.emergentmind.com/topics/few-shot-adaptation-627e82bc-c857-4105-a255-47e39c9d5755" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data  x-tooltip.raw="">few-shot adaptation</a> is possible via <a href="https://www.emergentmind.com/topics/low-rank-adaptation-lora-modules" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data  x-tooltip.raw="">LoRA</a> (rank=4).</p>
<h2 class='paper-heading' id='evaluation-metrics-and-results'>4. Evaluation, Metrics, and Results</h2>
<p>CoT-Collection enables both zero-shot and few-shot evaluations. The primary metric is accuracy:</p>

<p><span class="katex-display"><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><semantics><mrow><mtext>Accuracy</mtext><mo>=</mo><mfrac><mrow><mi mathvariant="normal">#</mi><mtext>Correct</mtext></mrow><mrow><mi mathvariant="normal">#</mi><mtext>Total</mtext></mrow></mfrac><mo>×</mo><mn>100</mn><mi mathvariant="normal">%</mi></mrow><annotation encoding="application/x-tex">\text{Accuracy} = \frac{\#\text{Correct}}{\#\text{Total}} \times 100\%</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.8778em;vertical-align:-0.1944em;"></span><span class="mord text"><span class="mord">Accuracy</span></span><span class="mspace" style="margin-right:0.2778em;"></span><span class="mrel">=</span><span class="mspace" style="margin-right:0.2778em;"></span></span><span class="base"><span class="strut" style="height:2.2519em;vertical-align:-0.8804em;"></span><span class="mord"><span class="mopen nulldelimiter"></span><span class="mfrac"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:1.3714em;"><span style="top:-2.314em;"><span class="pstrut" style="height:3em;"></span><span class="mord"><span class="mord">#</span><span class="mord text"><span class="mord">Total</span></span></span></span><span style="top:-3.23em;"><span class="pstrut" style="height:3em;"></span><span class="frac-line" style="border-bottom-width:0.04em;"></span></span><span style="top:-3.677em;"><span class="pstrut" style="height:3em;"></span><span class="mord"><span class="mord">#</span><span class="mord text"><span class="mord">Correct</span></span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.8804em;"><span></span></span></span></span></span><span class="mclose nulldelimiter"></span></span><span class="mspace" style="margin-right:0.2222em;"></span><span class="mbin">×</span><span class="mspace" style="margin-right:0.2222em;"></span></span><span class="base"><span class="strut" style="height:0.8056em;vertical-align:-0.0556em;"></span><span class="mord">100%</span></span></span></span></span></p>

<p><span class="katex-display"><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><semantics><mrow><mi mathvariant="normal">Δ</mi><mtext>Accuracy</mtext><mo>=</mo><msub><mtext>Accuracy</mtext><mtext>CoT-tuned</mtext></msub><mo>−</mo><msub><mtext>Accuracy</mtext><mtext>base</mtext></msub></mrow><annotation encoding="application/x-tex">\Delta \text{Accuracy} = \text{Accuracy}_{\text{CoT-tuned}} - \text{Accuracy}_{\text{base}}</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.8778em;vertical-align:-0.1944em;"></span><span class="mord">Δ</span><span class="mord text"><span class="mord">Accuracy</span></span><span class="mspace" style="margin-right:0.2778em;"></span><span class="mrel">=</span><span class="mspace" style="margin-right:0.2778em;"></span></span><span class="base"><span class="strut" style="height:0.9275em;vertical-align:-0.2441em;"></span><span class="mord"><span class="mord text"><span class="mord">Accuracy</span></span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.242em;"><span style="top:-2.4559em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord text mtight"><span class="mord mtight">CoT-tuned</span></span></span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.2441em;"><span></span></span></span></span></span></span><span class="mspace" style="margin-right:0.2222em;"></span><span class="mbin">−</span><span class="mspace" style="margin-right:0.2222em;"></span></span><span class="base"><span class="strut" style="height:0.9275em;vertical-align:-0.2441em;"></span><span class="mord"><span class="mord text"><span class="mord">Accuracy</span></span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.242em;"><span style="top:-2.4559em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord text mtight"><span class="mord mtight">base</span></span></span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.2441em;"><span></span></span></span></span></span></span></span></span></span></span></p>

<p>On BIG-Bench-Hard (27 tasks):</p>
<div class='overflow-x-auto max-w-full my-4'><table class='table border-collapse w-full' style='table-layout: fixed'><thead><tr>
<th>Model</th>
<th>Direct</th>
<th>CoT Eval</th>
<th>ΔCoT vs Flan</th>
</tr>
</thead><tbody><tr>
<td>Flan-T5-3B</td>
<td>37.1%</td>
<td>34.1%</td>
<td>—</td>
</tr>
<tr>
<td>CoT-T5-3B (Ours)</td>
<td>36.2%</td>
<td>38.4%</td>
<td>+4.34%</td>
</tr>
<tr>
<td>Flan-T5-11B</td>
<td>41.0%</td>
<td>38.6%</td>
<td>—</td>
</tr>
<tr>
<td>CoT-T5-11B (Ours)</td>
<td>42.6%</td>
<td>42.2%</td>
<td>+2.60%</td>
</tr>
</tbody></table></div>
<p>Few-shot (64-shot, 4 domains; using LoRA):</p>
<div class='overflow-x-auto max-w-full my-4'><table class='table border-collapse w-full' style='table-layout: fixed'><thead><tr>
<th>Model</th>
<th>#Params</th>
<th>Avg Acc</th>
<th>Δ vs. Flan</th>
</tr>
</thead><tbody><tr>
<td>Flan-T5-3B full FT</td>
<td>2.8B</td>
<td>61.78%</td>
<td>—</td>
</tr>
<tr>
<td>CoT-T5-3B + LoRA CoT-FT</td>
<td>2.35M</td>
<td>64.02%</td>
<td>+2.24%</td>
</tr>
<tr>
<td>Flan-T5-11B + LoRA FT</td>
<td>4.72M</td>
<td>66.59%</td>
<td>—</td>
</tr>
<tr>
<td>CoT-T5-11B + LoRA CoT-FT</td>
<td>4.72M</td>
<td>68.96%</td>
<td>+2.37%</td>
</tr>
<tr>
<td>ChatGPT+ICL (64 demos)</td>
<td>—</td>
<td>54.98%</td>
<td></td>
</tr>
</tbody></table></div>
<p>Models fine-tuned on the CoT-Collection attain better <a href="https://www.emergentmind.com/topics/zero-shot-generalization" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data  x-tooltip.raw="">zero-shot generalization</a> and show marked improvements over equivalent-size, standard-flan-trained and ICL-only baselines.</p>
<h2 class='paper-heading' id='scientific-insights-and-best-practices'>5. Scientific Insights and Best Practices</h2>
<p>Key findings:</p>

<ul>
<li>Explicit CoT supervision is necessary for teaching how to decompose and solve complex, multi-step problems; pure <a href="https://www.emergentmind.com/topics/in-context-learning" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data  x-tooltip.raw="">in-context learning</a> with CoT prompts is insufficient for smaller models.</li>
<li><a href="https://www.emergentmind.com/topics/diversity-beta-recall" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data  x-tooltip.raw="">Diversity</a> of task types in training, more so than volume, is essential for robust reasoning transfer; 10,000 diverse CoT examples from 1,060 tasks yield better generalization than 180,000 examples over just nine tasks.</li>
<li>Positive transfer is observed across task families, with no visible <a href="https://www.emergentmind.com/topics/catastrophic-forgetting" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data  x-tooltip.raw="">catastrophic forgetting</a> on reused tasks.</li>
<li><a href="https://www.emergentmind.com/topics/lora-adaptation" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data  x-tooltip.raw="">LoRA adaptation</a> enables efficient <a href="https://www.emergentmind.com/topics/few-shot-cot" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data  x-tooltip.raw="">few-shot CoT</a> learning with minimal parameter overhead.</li>
<li>Filtering rationales for answer-presence, brevity, and non-degeneration is a practical necessity; code-based filters are included in the release.</li>
</ul>
<h2 class='paper-heading' id='limitations-and-directions-for-extension'>6. Limitations and Directions for Extension</h2>
<p>CoT-Collection is exclusively English; current multilingual <a href="https://www.emergentmind.com/topics/zero-shot-performance" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data  x-tooltip.raw="">zero-shot performance</a> is near zero for Korean, Chinese, and Japanese. Rationale generation depends on the (proprietary) Codex model—future iterations may leverage open-source LLMs or more advanced multi-path rationales such as "Tree of Thoughts". It is recommended to extend to non-English settings, and to experiment with other approaches to CoT rationalization and evaluation protocols.</p>

<p>Reliance on human-crafted demonstrations for each family is a critical design choice—expanding this to cover more instruction types or to support semi-automatic CoT generation in new domains would be a natural extension.</p>
<h2 class='paper-heading' id='availability-and-impact'>7. Availability and Impact</h2>
<p>The dataset, code, and model checkpoints are publicly accessible. CoT-Collection is, to date, the largest and most diverse open instruction-tuning resource with CoT supervision for LLM reasoning. It serves as a reference corpus for model pretraining, as a basis for research in reasoning and transfer, and as a framework for future advances in medium- and low-parameter LM chain-of-thought generalization (<a href="/papers/2305.14045" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data  x-tooltip.raw="">Kim et al., 2023</a>).</p>s think step by step.'
    return {'prompt': prompt, 'labels': ex['chain_of_thought'] + ' [ANSWER] ' + ex['answer']}
train_ds = ds.map(preprocess, remove_columns=ds.column_names)

Fine-tuning is executed with hyperparameters—Flan-T5-3B: AdamW, batch 64, lr=5e-5, accumulation 8, 1 epoch; Flan-T5-11B: Adafactor, batch 8, lr=1e-4. Mini-batches should be sampled across sources to ensure distributional uniformity (e.g., FLAN 23.9%, P3 30.9%, SNI 25.5%, etc.). Efficient few-shot adaptation is possible via LoRA (rank=4).

4. Evaluation, Metrics, and Results

CoT-Collection enables both zero-shot and few-shot evaluations. The primary metric is accuracy:

$\text{Accuracy} = \frac{\#\text{Correct}}{\#\text{Total}} \times 100\%$

$\Delta \text{Accuracy} = \text{Accuracy}_{\text{CoT-tuned}} - \text{Accuracy}_{\text{base}}$

On BIG-Bench-Hard (27 tasks):

Model	Direct	CoT Eval	ΔCoT vs Flan
Flan-T5-3B	37.1%	34.1%	—
CoT-T5-3B (Ours)	36.2%	38.4%	+4.34%
Flan-T5-11B	41.0%	38.6%	—
CoT-T5-11B (Ours)	42.6%	42.2%	+2.60%

Few-shot (64-shot, 4 domains; using LoRA):

Model	#Params	Avg Acc	Δ vs. Flan
Flan-T5-3B full FT	2.8B	61.78%	—
CoT-T5-3B + LoRA CoT-FT	2.35M	64.02%	+2.24%
Flan-T5-11B + LoRA FT	4.72M	66.59%	—
CoT-T5-11B + LoRA CoT-FT	4.72M	68.96%	+2.37%
ChatGPT+ICL (64 demos)	—	54.98%

Models fine-tuned on the CoT-Collection attain better zero-shot generalization and show marked improvements over equivalent-size, standard-flan-trained and ICL-only baselines.

5. Scientific Insights and Best Practices

Key findings:

Explicit CoT supervision is necessary for teaching how to decompose and solve complex, multi-step problems; pure in-context learning with CoT prompts is insufficient for smaller models.
Diversity of task types in training, more so than volume, is essential for robust reasoning transfer; 10,000 diverse CoT examples from 1,060 tasks yield better generalization than 180,000 examples over just nine tasks.
Positive transfer is observed across task families, with no visible catastrophic forgetting on reused tasks.
LoRA adaptation enables efficient few-shot CoT learning with minimal parameter overhead.
Filtering rationales for answer-presence, brevity, and non-degeneration is a practical necessity; code-based filters are included in the release.

6. Limitations and Directions for Extension

CoT-Collection is exclusively English; current multilingual zero-shot performance is near zero for Korean, Chinese, and Japanese. Rationale generation depends on the (proprietary) Codex model—future iterations may leverage open-source LLMs or more advanced multi-path rationales such as "Tree of Thoughts". It is recommended to extend to non-English settings, and to experiment with other approaches to CoT rationalization and evaluation protocols.

Reliance on human-crafted demonstrations for each family is a critical design choice—expanding this to cover more instruction types or to support semi-automatic CoT generation in new domains would be a natural extension.

7. Availability and Impact

The dataset, code, and model checkpoints are publicly accessible. CoT-Collection is, to date, the largest and most diverse open instruction-tuning resource with CoT supervision for LLM reasoning. It serves as a reference corpus for model pretraining, as a basis for research in reasoning and transfer, and as a framework for future advances in medium- and low-parameter LM chain-of-thought generalization (Kim et al., 2023).

Markdown Report Issue Upgrade to Chat

References (1)

The CoT Collection: Improving Zero-shot and Few-shot Learning of Language Models via Chain-of-Thought Fine-Tuning (2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to CoT-Collection Dataset.

CoT-Collection Dataset Overview

1. Dataset Composition and Structure

2. Annotation and Generation Pipeline

3. Usage and Model Fine-Tuning Protocol

4. Evaluation, Metrics, and Results

5. Scientific Insights and Best Practices

6. Limitations and Directions for Extension

7. Availability and Impact

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

CoT-Collection Dataset Overview

1. Dataset Composition and Structure

2. Annotation and Generation Pipeline

3. Usage and Model Fine-Tuning Protocol

4. Evaluation, Metrics, and Results

5. Scientific Insights and Best Practices

6. Limitations and Directions for Extension

7. Availability and Impact

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research