Caco-1.3M Dataset Overview
- Caco-1.3M is a large-scale chain-of-thought dataset comprising over one million high-quality instruction–reasoning–code triples for LLM training.
- It leverages a fully automated pipeline using code normalization, fine-tuning, model sampling, and robust execution and AST-based filtering.
- Empirical results show notable performance gains on benchmarks like MATH and GSM8K, enhancing LLM reasoning and reliability.
The Caco-1.3M dataset is a large-scale, automatically generated corpus of code-assisted chain-of-thought (CoT) and instruction data, designed to improve the reasoning capabilities of LLMs. Synthesized through a fully automated and verifiable pipeline, Caco-1.3M provides over one million high-quality, structurally diverse instruction–reasoning–code triples suitable for training and evaluating LLMs on mathematical and algorithmic tasks.
1. Data Genesis and Synthesis Workflow
The Caco-1.3M dataset is produced through an automated closed-loop pipeline that begins with the collection and normalization of high-quality code-based chain-of-thought (Code CoT) demonstrations. Source problems span mathematical and algorithmic domains, with initial samples drawn from datasets such as MATH, DeepScaleR, and BigMath. Each problem and solution is converted into a standardized Python template, imposing a uniform structure—typically involving the definition of an input dictionary, invocation of a solution function, and explicit output printing.
Following normalization, a dedicated CodeGen model is fine-tuned on these unified Code CoTs. The fine-tuned model is then employed in a generative sampling mode, producing millions of candidate reasoning traces in executable code form. Every candidate is subjected to rigorous automated filtering, including:
- Execution-based filtering: Ensures syntactic validity, runtime boundedness, and output correctness by executing the code within time limits and confirming the produced result.
- AST-based semantic validation: Confirms the use of declared input variables and logical consistency in code structure.
- Rule-based checks: Applies heuristic or domain-specific rules to enforce quality and variation.
Only code samples passing all filters proceed to the next stage, where they are reverse-engineered into natural language problem statements paired with step-by-step language CoTs. The language solution's final answer is required to match the result of executing the code, ensuring logical alignment between modalities. Multiple natural language variants of the same logical template are also generated to introduce instruction diversity.
2. Framework Architecture and Methodological Principles
At the core of the Caco-1.3M framework—termed Caco (Code-Assisted Chain-of-Thought)—is an unconditional CodeGen model trained via maximum likelihood estimation over the collected Code CoT corpus. The training objective is given as:
where denotes the set of unified seed code samples, and is the token in a code sequence.
After training, model samples are collected with temperature-controlled decoding to encourage structural and procedural diversity. Each sample is validated using the aforementioned automated checks. The resultant filtered set comprises approximately 4.6 million verified Code CoTs.
The process then employs a two-stage back-translation:
- Stage one: Representative input–output pairs from the code snippet are used to synthesize a corresponding natural language instruction.
- Stage two: The new instruction undergoes solution generation, yielding a natural language chain-of-thought. Correctness is enforced by comparing the inferred language answer to the code execution , requiring consistency .
The formal definition of the dataset is:
where denotes the LLM used for translation and answer synthesis.
3. Structural Correctness and Data Diversity
Logical correctness in Caco-1.3M is strictly enforced via program execution: only code traces whose output matches the expected answer proceed to the dataset. This design minimizes the accumulation of spurious or illogical reasoning paths prevalent in purely language-based datasets.
Structural and instruction diversity are achieved through:
- Problem-level augmentation: Rephrasing single logical templates into multiple unique natural language instructions.
- Pattern-level augmentation: Sampling the CodeGen model to explore novel reasoning strategies, decompositions, and solution approaches not explicitly present in the original seeds.
Table: Verification and Diversity Mechanisms in Caco-1.3M
Mechanism | Purpose | Stage |
---|---|---|
Code execution | Guarantees logical soundness | Post-generation |
AST-based validation | Enforces semantic integrity | Filtering |
Language rephrasing | Enhances instruction variety | Reverse-engineering |
Model sampling (temp>0) | Adds structural diversity | Code generation |
This combination of verifiable correctness and structural/instructional diversity underpins the dataset's adaptability and reliability for downstream LLM training.
4. Empirical Evaluation and Benchmark Performance
Evaluation of models fine-tuned with Caco-1.3M demonstrates competitive and, in several cases, superior performance across standard mathematical reasoning benchmarks in both zero-shot and direct evaluation settings. Notable results include:
- On the MATH and GSM8K benchmarks, Caco-trained models achieve significant gains compared to baseline models trained solely on the 109K seed samples.
- For example, a LLaMA3-8B model's average accuracy increases from 46.7 to 57.3 upon fine-tuning with Caco-1.3M.
- The Qwen2.5-Math-7B model achieves 67.7 average accuracy.
- Pass@1 accuracy improvements are reported across benchmarks including MATH, GSM8K, CollegeMath, DeepMind-Mathematics, OlympiadBench-Math, and TheoremQA.
Performance on OlympiadBench and TheoremQA demonstrates the dataset's benefits for elevating model competence on complex, Olympiad-level problems, not simply on routine question types.
5. Generalization, Trustworthiness, and Application Scope
The signature feature of Caco-1.3M is code-anchored verifiability: every chain-of-thought is executable and verified for output consistency, reducing the risk of propagation of faulty reasoning typical in language-only corpora. This property, combined with systematic augmentation of instructions and solution strategies, yields improved cross-domain generalization.
Empirical evidence suggests enhanced performance not only on mathematical benchmarks but also in scientific QA, logic puzzles, and programming tasks. Caco-1.3M thus provides a basis for building self-sustaining reasoning systems that enforce correctness guarantees with minimal human intervention. Typical application domains include automated tutoring systems, AI assistants for STEM education, and reinforcement learning agents relying on verifiable computational rewards.
6. Technical Implementation and Quality Assurance
The implementation of Caco-1.3M leverages unified Python-based templates—explicitly defining input dictionaries, solution functions, and output statements—to ensure homogeneity and direct executability across all generated samples. Quality assurance is achieved through multiple filtering criteria:
- Runtime limits on code execution to avoid pathological or non-terminating solutions.
- Minimum code length thresholds to enforce sufficient reasoning complexity.
- AST-based parsing for semantic checks.
The pipeline's closed-loop nature—generation, verification, reverse translation—ensures that both instruction and reasoning chain are aligned, correct, and adaptable.
7. Significance in Automated Reasoning Research
Caco-1.3M exemplifies a scalable, code-anchored paradigm for reasoning dataset synthesis, facilitating the creation of trustworthy and robust LLMs. By combining rigorous execution-based validation with structured augmentation and reverse translation, it enables high-fidelity instruction–CoT–code data at unprecedented scale, providing strong empirical improvements across a range of mathematical and logical reasoning tasks. The dataset sets a methodological precedent for future research on autonomous LLM reasoning and verifiable dataset construction, and serves as a blueprint for self-sustaining, automated systems in the broader domain of trustworthy artificial intelligence.