CoT-Collection Dataset Overview
- CoT-Collection is a large-scale, instruction-tuning dataset featuring over 1.84 million examples and 1,060 tasks across 26 diverse families.
- The dataset utilizes a hybrid human-curated and model-driven pipeline, employing Codex for generating chain-of-thought rationales with rigorous filtering.
- It enhances zero-shot and few-shot reasoning in language models through explicit CoT supervision and efficient fine-tuning protocols like LoRA.
The CoT-Collection Dataset is a large-scale, instruction-tuning resource designed to equip LLMs with explicit chain-of-thought (CoT) reasoning capabilities across a broad spectrum of tasks. It was introduced to address the inherent limitations of smaller (<100B parameter) LLMs in generating multi-step rationales, enabling these models to approach the zero-shot and few-shot reasoning capacities heretofore exclusive to much larger models. The dataset serves both as a pretraining corpus for intermediate rational supervision and as a benchmark for research on reasoning transfer, task generalization, and template learning behaviors in natural language processing.
1. Dataset Composition and Structure
The CoT-Collection comprises 1,060 distinct instruction-formatted tasks, covering 1.84 million examples, each annotated with 1–5 CoT rationales. Tasks are distributed among 26 “families,” including multiple-choice QA, extractive QA, arithmetic word problems, commonsense reasoning, natural-language inference, symbolic logic, list-manipulation, dialogue, and code-oriented domains. Unlike the original Flan Collection, which included only nine CoT-augmented tasks, the CoT-Collection offers comprehensive CoT coverage across domains.
Table: Task Family Distribution (Selected)
| Family | #Tasks | #Examples |
|---|---|---|
| Multi-choice QA | 220 | 420K |
| Extractive QA | 180 | 300K |
| Arithmetic Reasoning | 105 | 200K |
| Commonsense (SNI) | 150 | 250K |
| NLI (FLAN) | 80 | 150K |
| Logic & Symbolic | 75 | 120K |
| Dialogue/Code/Others | 250 | 400K |
Each dataset example is a JSONL record with fields: {task_id, instruction, input, chain_of_thought, answer} stored in ten ≈184K-instance shards. Examples are constructed using prompt instruction, input (possibly empty), CoT rationale (with explicit answer token), and ground-truth answer.
2. Annotation and Generation Pipeline
Annotation follows a unified, partly human-curated and largely model-based process:
- Task family grouping: Each task is associated with a demonstration set DT_k (typically 6–8 high-quality, hand-written rationales by authors).
- Rationale generation: OpenAI Codex (code-davinci-002) is prompted with the "Let's think step by step" phrase in few-shot in-context style, conditioning on ground-truth label.
- Decoding: Nucleus (top-p=0.8) sampling and no-repeat ngram enforcement, generating five rationale candidates per example.
- Filtering: Post-processing retains rationales containing the gold answer, with length <512 tokens, unique content, and absence of code degeneration or repeated sentences.
Manual A/B testing and filtering are performed to ensure fluency and informativeness of initial demonstration rationales; all further rationale generation is model-driven according to the described criteria.
3. Usage and Model Fine-Tuning Protocol
For supervised fine-tuning, representative code examples are provided:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 |
from datasets import load_dataset ds = load_dataset('json', data_files='cot_collection/*.jsonl', split='train') def preprocess(ex): prompt = ex['instruction'] + '\n' + ex['input'] + '\nLet</p> <p>Fine-tuning is executed with hyperparameters—Flan-T5-3B: AdamW, batch 64, lr=5e-5, accumulation 8, 1 epoch; Flan-T5-11B: Adafactor, batch 8, lr=1e-4. Mini-batches should be sampled across sources to ensure distributional uniformity (e.g., FLAN 23.9%, P3 30.9%, SNI 25.5%, etc.). Efficient <a href="https://www.emergentmind.com/topics/few-shot-adaptation" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">few-shot adaptation</a> is possible via <a href="https://www.emergentmind.com/topics/geometry-aware-low-rank-adaptation-lora" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">LoRA</a> (rank=4).</p> <h2 class='paper-heading' id='evaluation-metrics-and-results'>4. Evaluation, Metrics, and Results</h2> <p>CoT-Collection enables both zero-shot and few-shot evaluations. The primary metric is accuracy:</p> <p><span class="katex-display"><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><semantics><mrow><mtext>Accuracy</mtext><mo>=</mo><mfrac><mrow><mi mathvariant="normal">#</mi><mtext>Correct</mtext></mrow><mrow><mi mathvariant="normal">#</mi><mtext>Total</mtext></mrow></mfrac><mo>×</mo><mn>100</mn><mi mathvariant="normal">%</mi></mrow><annotation encoding="application/x-tex">\text{Accuracy} = \frac{\#\text{Correct}}{\#\text{Total}} \times 100\%</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.8778em;vertical-align:-0.1944em;"></span><span class="mord text"><span class="mord">Accuracy</span></span><span class="mspace" style="margin-right:0.2778em;"></span><span class="mrel">=</span><span class="mspace" style="margin-right:0.2778em;"></span></span><span class="base"><span class="strut" style="height:2.2519em;vertical-align:-0.8804em;"></span><span class="mord"><span class="mopen nulldelimiter"></span><span class="mfrac"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:1.3714em;"><span style="top:-2.314em;"><span class="pstrut" style="height:3em;"></span><span class="mord"><span class="mord">#</span><span class="mord text"><span class="mord">Total</span></span></span></span><span style="top:-3.23em;"><span class="pstrut" style="height:3em;"></span><span class="frac-line" style="border-bottom-width:0.04em;"></span></span><span style="top:-3.677em;"><span class="pstrut" style="height:3em;"></span><span class="mord"><span class="mord">#</span><span class="mord text"><span class="mord">Correct</span></span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.8804em;"><span></span></span></span></span></span><span class="mclose nulldelimiter"></span></span><span class="mspace" style="margin-right:0.2222em;"></span><span class="mbin">×</span><span class="mspace" style="margin-right:0.2222em;"></span></span><span class="base"><span class="strut" style="height:0.8056em;vertical-align:-0.0556em;"></span><span class="mord">100%</span></span></span></span></span></p> <p><span class="katex-display"><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><semantics><mrow><mi mathvariant="normal">Δ</mi><mtext>Accuracy</mtext><mo>=</mo><msub><mtext>Accuracy</mtext><mtext>CoT-tuned</mtext></msub><mo>−</mo><msub><mtext>Accuracy</mtext><mtext>base</mtext></msub></mrow><annotation encoding="application/x-tex">\Delta \text{Accuracy} = \text{Accuracy}_{\text{CoT-tuned}} - \text{Accuracy}_{\text{base}}</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.8778em;vertical-align:-0.1944em;"></span><span class="mord">Δ</span><span class="mord text"><span class="mord">Accuracy</span></span><span class="mspace" style="margin-right:0.2778em;"></span><span class="mrel">=</span><span class="mspace" style="margin-right:0.2778em;"></span></span><span class="base"><span class="strut" style="height:0.9275em;vertical-align:-0.2441em;"></span><span class="mord"><span class="mord text"><span class="mord">Accuracy</span></span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.242em;"><span style="top:-2.4559em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord text mtight"><span class="mord mtight">CoT-tuned</span></span></span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.2441em;"><span></span></span></span></span></span></span><span class="mspace" style="margin-right:0.2222em;"></span><span class="mbin">−</span><span class="mspace" style="margin-right:0.2222em;"></span></span><span class="base"><span class="strut" style="height:0.9275em;vertical-align:-0.2441em;"></span><span class="mord"><span class="mord text"><span class="mord">Accuracy</span></span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.242em;"><span style="top:-2.4559em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord text mtight"><span class="mord mtight">base</span></span></span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.2441em;"><span></span></span></span></span></span></span></span></span></span></span></p> <p>On BIG-Bench-Hard (27 tasks):</p> <div class='overflow-x-auto max-w-full my-4'><table class='table border-collapse w-full' style='table-layout: fixed'><thead><tr> <th>Model</th> <th>Direct</th> <th>CoT Eval</th> <th>ΔCoT vs Flan</th> </tr> </thead><tbody><tr> <td>Flan-T5-3B</td> <td>37.1%</td> <td>34.1%</td> <td>—</td> </tr> <tr> <td>CoT-T5-3B (Ours)</td> <td>36.2%</td> <td>38.4%</td> <td>+4.34%</td> </tr> <tr> <td>Flan-T5-11B</td> <td>41.0%</td> <td>38.6%</td> <td>—</td> </tr> <tr> <td>CoT-T5-11B (Ours)</td> <td>42.6%</td> <td>42.2%</td> <td>+2.60%</td> </tr> </tbody></table></div> <p>Few-shot (64-shot, 4 domains; using LoRA):</p> <div class='overflow-x-auto max-w-full my-4'><table class='table border-collapse w-full' style='table-layout: fixed'><thead><tr> <th>Model</th> <th>#Params</th> <th>Avg Acc</th> <th>Δ vs. Flan</th> </tr> </thead><tbody><tr> <td>Flan-T5-3B full FT</td> <td>2.8B</td> <td>61.78%</td> <td>—</td> </tr> <tr> <td>CoT-T5-3B + LoRA CoT-FT</td> <td>2.35M</td> <td>64.02%</td> <td>+2.24%</td> </tr> <tr> <td>Flan-T5-11B + LoRA FT</td> <td>4.72M</td> <td>66.59%</td> <td>—</td> </tr> <tr> <td>CoT-T5-11B + LoRA CoT-FT</td> <td>4.72M</td> <td>68.96%</td> <td>+2.37%</td> </tr> <tr> <td>ChatGPT+ICL (64 demos)</td> <td>—</td> <td>54.98%</td> <td></td> </tr> </tbody></table></div> <p>Models fine-tuned on the CoT-Collection attain better zero-shot generalization and show marked improvements over equivalent-size, standard-flan-trained and ICL-only baselines.</p> <h2 class='paper-heading' id='scientific-insights-and-best-practices'>5. Scientific Insights and Best Practices</h2> <p>Key findings:</p> <ul> <li>Explicit CoT supervision is necessary for teaching how to decompose and solve complex, multi-step problems; pure <a href="https://www.emergentmind.com/topics/in-context-learning" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">in-context learning</a> with CoT prompts is insufficient for smaller models.</li> <li>Diversity of task types in training, more so than volume, is essential for robust reasoning transfer; 10,000 diverse CoT examples from 1,060 tasks yield better generalization than 180,000 examples over just nine tasks.</li> <li>Positive transfer is observed across task families, with no visible <a href="https://www.emergentmind.com/topics/catastrophic-forgetting" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">catastrophic forgetting</a> on reused tasks.</li> <li>LoRA adaptation enables efficient few-shot CoT learning with minimal parameter overhead.</li> <li>Filtering rationales for answer-presence, brevity, and non-degeneration is a practical necessity; code-based filters are included in the release.</li> </ul> <h2 class='paper-heading' id='limitations-and-directions-for-extension'>6. Limitations and Directions for Extension</h2> <p>CoT-Collection is exclusively English; current multilingual <a href="https://www.emergentmind.com/topics/zero-shot-performance" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">zero-shot performance</a> is near zero for Korean, Chinese, and Japanese. Rationale generation depends on the (proprietary) Codex model—future iterations may leverage open-source LLMs or more advanced multi-path rationales such as "Tree of Thoughts". It is recommended to extend to non-English settings, and to experiment with other approaches to CoT rationalization and evaluation protocols.</p> <p>Reliance on human-crafted demonstrations for each family is a critical design choice—expanding this to cover more instruction types or to support semi-automatic CoT generation in new domains would be a natural extension.</p> <h2 class='paper-heading' id='availability-and-impact'>7. Availability and Impact</h2> <p>The dataset, code, and model checkpoints are publicly accessible. CoT-Collection is, to date, the largest and most diverse open instruction-tuning resource with CoT supervision for LLM reasoning. It serves as a reference corpus for model pretraining, as a basis for research in reasoning and transfer, and as a framework for future advances in medium- and low-parameter LM chain-of-thought generalization (<a href="/papers/2305.14045" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">Kim et al., 2023</a>).</p>s think step by step.' return {'prompt': prompt, 'labels': ex['chain_of_thought'] + ' [ANSWER] ' + ex['answer']} train_ds = ds.map(preprocess, remove_columns=ds.column_names) |
Fine-tuning is executed with hyperparameters—Flan-T5-3B: AdamW, batch 64, lr=5e-5, accumulation 8, 1 epoch; Flan-T5-11B: Adafactor, batch 8, lr=1e-4. Mini-batches should be sampled across sources to ensure distributional uniformity (e.g., FLAN 23.9%, P3 30.9%, SNI 25.5%, etc.). Efficient few-shot adaptation is possible via LoRA (rank=4).
4. Evaluation, Metrics, and Results
CoT-Collection enables both zero-shot and few-shot evaluations. The primary metric is accuracy:
On BIG-Bench-Hard (27 tasks):
| Model | Direct | CoT Eval | ΔCoT vs Flan |
|---|---|---|---|
| Flan-T5-3B | 37.1% | 34.1% | — |
| CoT-T5-3B (Ours) | 36.2% | 38.4% | +4.34% |
| Flan-T5-11B | 41.0% | 38.6% | — |
| CoT-T5-11B (Ours) | 42.6% | 42.2% | +2.60% |
Few-shot (64-shot, 4 domains; using LoRA):
| Model | #Params | Avg Acc | Δ vs. Flan |
|---|---|---|---|
| Flan-T5-3B full FT | 2.8B | 61.78% | — |
| CoT-T5-3B + LoRA CoT-FT | 2.35M | 64.02% | +2.24% |
| Flan-T5-11B + LoRA FT | 4.72M | 66.59% | — |
| CoT-T5-11B + LoRA CoT-FT | 4.72M | 68.96% | +2.37% |
| ChatGPT+ICL (64 demos) | — | 54.98% |
Models fine-tuned on the CoT-Collection attain better zero-shot generalization and show marked improvements over equivalent-size, standard-flan-trained and ICL-only baselines.
5. Scientific Insights and Best Practices
Key findings:
- Explicit CoT supervision is necessary for teaching how to decompose and solve complex, multi-step problems; pure in-context learning with CoT prompts is insufficient for smaller models.
- Diversity of task types in training, more so than volume, is essential for robust reasoning transfer; 10,000 diverse CoT examples from 1,060 tasks yield better generalization than 180,000 examples over just nine tasks.
- Positive transfer is observed across task families, with no visible catastrophic forgetting on reused tasks.
- LoRA adaptation enables efficient few-shot CoT learning with minimal parameter overhead.
- Filtering rationales for answer-presence, brevity, and non-degeneration is a practical necessity; code-based filters are included in the release.
6. Limitations and Directions for Extension
CoT-Collection is exclusively English; current multilingual zero-shot performance is near zero for Korean, Chinese, and Japanese. Rationale generation depends on the (proprietary) Codex model—future iterations may leverage open-source LLMs or more advanced multi-path rationales such as "Tree of Thoughts". It is recommended to extend to non-English settings, and to experiment with other approaches to CoT rationalization and evaluation protocols.
Reliance on human-crafted demonstrations for each family is a critical design choice—expanding this to cover more instruction types or to support semi-automatic CoT generation in new domains would be a natural extension.
7. Availability and Impact
The dataset, code, and model checkpoints are publicly accessible. CoT-Collection is, to date, the largest and most diverse open instruction-tuning resource with CoT supervision for LLM reasoning. It serves as a reference corpus for model pretraining, as a basis for research in reasoning and transfer, and as a framework for future advances in medium- and low-parameter LM chain-of-thought generalization (Kim et al., 2023).