Thinking Augmented Pre-training (2509.20186v1)
Abstract: This paper introduces a simple and scalable approach to improve the data efficiency of LLM training by augmenting existing text data with thinking trajectories. The compute for pre-training LLMs has been growing at an unprecedented rate, while the availability of high-quality data remains limited. Consequently, maximizing the utility of available data constitutes a significant research challenge. A primary impediment is that certain high-quality tokens are difficult to learn given a fixed model capacity, as the underlying rationale for a single token can be exceptionally complex and deep. To address this issue, we propose Thinking augmented Pre-Training (TPT), a universal methodology that augments text with automatically generated thinking trajectories. Such augmentation effectively increases the volume of the training data and makes high-quality tokens more learnable through step-by-step reasoning and decomposition. We apply TPT across diverse training configurations up to $100$B tokens, encompassing pre-training with both constrained and abundant data, as well as mid-training from strong open-source checkpoints. Experimental results indicate that our method substantially improves the performance of LLMs across various model sizes and families. Notably, TPT enhances the data efficiency of LLM pre-training by a factor of $3$. For a $3$B parameter model, it improves the post-training performance by over $10\%$ on several challenging reasoning benchmarks.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
A simple explanation of “Thinking Augmented Pre-training (TPT)”
1. What is this paper about?
This paper is about teaching AI LLMs to learn better from the data they already have. The authors add “thinking steps” (like showing your work in math) to the text the model reads during training. This helps the model understand hard ideas more easily and become smarter without needing tons more data.
2. What were the main questions?
The researchers asked:
- Can we make LLMs learn more from the same data by adding step-by-step explanations?
- Will this help models handle difficult problems (like math and reasoning) better?
- Does this method still work when there isn’t much high-quality data?
- Can we improve existing models by giving them this kind of extra “thinking practice”?
3. How did they do it? (Methods explained simply)
First, a few quick translations:
- Pre-training: Like a model reading a huge library to learn general patterns in language.
- Token: A small chunk of text (like a word or piece of a word) the model reads or predicts.
- Thinking trajectory: A step-by-step explanation that shows how to reach a conclusion (like a worked-out solution in a textbook).
- Mid-training: Giving an already trained model extra practice on specific kinds of data.
- Supervised fine-tuning (SFT): Polishing a model to follow instructions and answer in a helpful way.
What they did:
- They took normal text from the web and other sources.
- They used an existing open-source AI model to write an “expert’s thought process” about that text. This is the thinking trajectory.
- They glued the original text and the thinking together into one long training sample.
- They trained new models to predict the next token on this combined text. In everyday terms: the model doesn’t just see the final answer; it also sees the steps that led there, so it can learn how to think, not just what to say.
Why this helps:
- Hard answers often hide long reasoning chains. If the model only sees the final answer (like “890”), it may not learn the logic behind it. Showing the steps makes learning easier.
- Longer thinking for tougher text naturally gives more training effort to the most valuable, difficult parts—similar to how you spend more time on hard homework problems.
Where they tested it:
- Abundant data: Lots of training text (100 billion tokens).
- Constrained data: Limited unique text (simulate running out of good web data).
- Mid-training: Start with popular open models (like Qwen and LLaMA), give them TPT practice, then do SFT for helpful chat behavior.
4. What did they find, and why is it important?
Main results (in plain language):
- Much better data efficiency: With TPT, models got about 3× more value from the same amount of training. That means less data and compute for similar or better results.
- Stronger reasoning: An 8B-parameter model trained with TPT on 100B tokens did far better than the same model trained normally, especially on math and logic. For example, on the GSM8K math benchmark, scores jumped from about 19% to about 50%; on MATH, from about 9% to about 22%.
- After a short polishing step (SFT), the TPT-trained models did extremely well on tough tests like AIME, GPQA, and MMLU-Pro, often beating strong open baselines.
- Works when data is limited: When the amount of unique text was capped, normal training leveled off, but TPT kept improving because the added thinking steps extracted more value from the same sources.
- Upgrades existing models: Adding TPT mid-training to models like Qwen2.5 and LLaMA-3 improved performance across 10 challenging benchmarks (math, code, and general reasoning). For example, one 3B model’s AIME24 score roughly tripled after TPT mid-training.
- Auto-focus on valuable material: The generated thinking was naturally longer in math and science and for harder texts, meaning the training spent more time on the most important parts—without hand-tuning.
- Simple and scalable: No human labeling is needed; a standard prompt generates the thinking steps. Surprisingly, even a smaller model generating the thinking sometimes worked best.
Why this matters:
- The internet’s best text is limited. Getting more learning out of it is crucial for future AI progress.
- Better reasoning means AI can handle complex problems more reliably, not just memorize answers.
5. What’s the bigger picture?
This work suggests a practical path to smarter, more thoughtful AI without endlessly collecting more data. By adding clear, step-by-step “teacher notes” to training text:
- Companies and researchers can train useful models with fewer resources.
- Smaller models can reach stronger reasoning skills.
- Future systems could be better at math, coding, and careful problem-solving.
- This method can combine with other data-cleaning and rewriting tricks to push performance even further.
In short: Teaching AI to “show its work” during training helps it learn how to think, not just what to say—and that makes it both cheaper and better.
Knowledge Gaps
Knowledge gaps, limitations, and open questions
Below is a single, consolidated list of what remains missing, uncertain, or unexplored in the paper, phrased to be actionable for future research.
- Quantify end-to-end costs: Report the compute, time, and storage required to generate thinking trajectories at 10–100B-token scale, and compare total training+generation cost vs vanilla and alternative data-engineering baselines.
- Data contamination checks: Establish rigorous benchmark leakage analysis (e.g., GSM8k/MATH/GPQA/MMLU-Pro/LCB) against both raw and augmented corpora; document deduplication policies and contamination detection methodology.
- Correctness and faithfulness of “thinking”: Measure the factual/step-level correctness of generated trajectories (especially for math/code) and paper how incorrect or spurious steps affect downstream performance; integrate automatic verifiers and filtering.
- Mechanism disentanglement: Determine whether gains come from longer sequences, domain upsampling, or structured reasoning by controlled experiments (e.g., same-length paraphrase augmentation, chunked explanations, shuffled/garbled “thinking,” length-matched non-thinking augmentations).
- Prompt sensitivity: Systematically evaluate alternative prompts (domain-specific, tag-scaffolded, multi-turn, self-ask) and automated prompt optimization to understand sensitivity and reliability across domains.
- Generator choice and scaling: Expand beyond 1.5B and 7B generators to a systematic sweep of sizes, families, training styles (RL vs SFT), ensembles, and domain-specialized generators; identify which properties of the generator best predict downstream gains.
- Quality control pipeline: Define and validate filtering metrics for thinking trajectories (e.g., length, perplexity, factuality, solver-backed correctness, redundancy, entropy) and their impact on performance and efficiency.
- Adaptive compute allocation: Move beyond “natural” length allocation to controlled budgets—design policies that modulate thinking length or sample weighting by estimated difficulty/utility, and measure trade-offs.
- Objective design: Compare next-token prediction on doc+think vs alternatives (e.g., hierarchical losses, segment-aware objectives, prefix-LM on doc with separate decoder for “think,” contrastive or verifier-informed losses).
- Curriculum and mixing ratios: Explore schedules (doc-only → think-only → mixed), mixing ratios over training (and mid-training), and dynamic sampling strategies that gate when/how much “thinking” to include.
- Inference-time behavior: Assess whether TPT reduces the need for long chain-of-thought at inference (latency/throughput gains) or increases the tendency to produce unnecessary “thinking”; measure accuracy vs generated tokens with test-time scaling curves.
- Robustness and calibration: Evaluate effects on hallucination rates, confidence calibration, and robustness to adversarial/noisy inputs—especially when thinking trajectories are wrong or ambiguous.
- Domain coverage: Extend evaluation beyond math/code/general QA to knowledge-heavy, multimodal, retrieval-augmented, biomedical/legal, symbolic reasoning, and planning tasks; include multilingual settings and cross-lingual transfer.
- Non-reasoning tasks: Measure impact on summarization, translation, factual QA, and instruction-following where verbose “thinking” may be harmful; test whether TPT degrades brevity or adherence to format constraints.
- Safety and ethics: Audit whether thinking augmentation amplifies harmful content or unsafe step-by-step instructions; design and quantify safety filters for generated trajectories.
- Format drift and user experience: Determine if models trained on doc+think default to emitting “thinking” when not desired; quantify format adherence and controllability via instruction-tuning or tags.
- Sequence length constraints: Analyze how longer augmented samples interact with context window limits, memory footprint, throughput, optimizer stability, and gradient noise—especially at larger scales.
- Scaling beyond 100B tokens and 8B models: Test TPT at trillion-token budgets and frontier parameter counts; identify scaling laws and diminishing returns.
- Fair baseline comparisons: Compare against strong data-engineering alternatives (e.g., rephrasing/rewriting, active reading, BoLT, RPT) under matched compute and data distributions; include ablations combining TPT with these methods.
- Distributional confounds: Control for domain balance changes introduced by augmentation (e.g., math upsampling) to isolate the effect of “thinking” from data mixture differences.
- Statistical rigor: Provide variance estimates, confidence intervals, and multiple seeds across tasks to substantiate claims such as “3× data efficiency.”
- Code and data release: Publish augmentation pipelines, prompts, and subsets to enable reproducibility, and document licensing/usage restrictions for synthetic trajectories.
- Verification-integrated training: Evaluate training that incorporates solver/verifier feedback to flag and fix faulty steps during learning (e.g., proof-checkers, unit tests for code) and measure gains over pure next-token loss.
- Synergy with RL and post-training: Study combinations with RL for reasoning (e.g., R1/o1-style methods), verifiers, and debate/self-consistency; characterize complementarity and best integration patterns.
- Long-term generalization and compositionality: Test whether TPT improves compositional generalization (novel combinations of skills) and cross-task transfer, not just benchmark-specific scores.
- Economic framing of “data efficiency”: Redefine efficiency to include augmentation generation cost, training tokens, inference tokens, and wall-clock time; report holistic cost-to-quality metrics.
Practical Applications
Practical Applications of “Thinking Augmented Pre-training (TPT)”
Below are actionable, real-world applications derived from the paper’s findings and method, grouped by deployment horizon. Each item lists sectors, the kinds of tools/products/workflows likely to emerge, and key assumptions or dependencies that affect feasibility.
Immediate Applications
- Model pre-training data-efficiency upgrade (AI/Software)
- What: Integrate TPT into existing pre-training pipelines to achieve up to ~3× data efficiency, especially for reasoning-intensive domains (math, code, scientific text).
- Tools/products/workflows: “Thinking Augmentation Service” that batches raw documents, generates trajectories with an open LLM, concatenates and packs into 8k samples; plug-ins for existing data curation stacks (e.g., dedup, filtering, domain balancing).
- Assumptions/dependencies: Availability of capable, licensed open LLMs for trajectory generation (e.g., DS-Distill-Qwen variants); compute budget for offline generation; robust dedup/safety filters to manage synthetic content; long-context training config.
- Continual pre-training (mid-training) to upgrade existing checkpoints (Industry, Startups, Academia)
- What: Apply TPT as a mid-training stage to open-source checkpoints (1.5B–7B) before SFT, yielding large gains on math/code/general benchmarks.
- Tools/products/workflows: “Reasoning Boost” mid-training recipes; templated configs for Qwen/LLaMA families; evaluation harnesses for GSM8k/MATH/HumanEval/MMLU-Pro.
- Assumptions/dependencies: Access to curated corpora (e.g., FineWeb-Edu, MegaMath); SFT dataset (e.g., Mixture-of-Thoughts) for alignment; training infra for 40–100B token mid-training.
- Domain-aware up-sampling via trajectory-length signals (Data Engineering)
- What: Use the observed correlation between thinking length and difficulty/value to reweight sampling (more compute to longer-trajectory samples, e.g., math/physics).
- Tools/products/workflows: Data schedulers that prioritize samples by generated trajectory length; dashboards monitoring “compute per valuable token.”
- Assumptions/dependencies: Correlation holds across new domains; trajectory generation quality remains stable; safeguards against gaming via artificially lengthened trajectories.
- Math and code assistant enhancements (Software, EdTech)
- What: Build assistants trained with TPT-augmented corpora to improve step-by-step reasoning, code synthesis, and debugging.
- Tools/products/workflows: IDE copilots with improved reasoning traces; code-review bots that explain fixes; math solvers for STEM coursework.
- Assumptions/dependencies: High-quality math/code corpora; inference configs that allow enough “thinking space” (longer max tokens); evaluation gates for correctness.
- Enterprise knowledge base “rationale layer” (Cross-industry)
- What: Augment internal documents with automatically generated expert-style analyses to improve model learnability and retrieval grounding.
- Tools/products/workflows: Document ingestors that append thinking sections; RAG pipelines that retrieve both source and rationale; policy controls to use thinking for training but suppress in user-facing outputs if required.
- Assumptions/dependencies: Data privacy and access controls; hallucination mitigation; compatibility with enterprise compliance policies.
- Education: step-by-step paper note generation (Education/Publishing)
- What: Convert textbooks, lecture notes, and problem sets into explanation-rich materials using TPT prompts (Feynman-style).
- Tools/products/workflows: LMS plug-ins that attach “thinking trajectories” to lessons; auto-generated worked examples; paper companions for AIME/MATH/GSM8k-style problems.
- Assumptions/dependencies: Copyright/licensing for source materials; human-in-the-loop review for accuracy; age-appropriate style control.
- Benchmarking and data curation improvements (Research/MLOps)
- What: Use trajectory length and reasoning-intensity tags as light-weight quality signals for dataset selection and curriculum design.
- Tools/products/workflows: Length-based filters; difficulty-aware curricula; A/B testing frameworks to correlate length distribution with downstream scores.
- Assumptions/dependencies: Avoid overfitting to “length as a proxy”; maintain domain diversity; reproducible metadata pipelines.
- Cost-optimized scaling for constrained-data settings (Model Developers)
- What: Replace multi-epoch replay of limited corpora with one-pass TPT augmentation that continues to yield improvements as training progresses.
- Tools/products/workflows: Token budgeting tools estimating effective “compute per unique raw token”; replay vs. TPT trade-off analyzers.
- Assumptions/dependencies: Accurate deduplication; careful mixture balancing to avoid domain drift; monitoring for synthetic style dominance.
- Safety red-teaming and explainability data generation (Safety/Policy/Trust)
- What: Generate “thinking around” risky or ambiguous cases to surface failure modes and to train verifiers/critics.
- Tools/products/workflows: Red-team datasets with rationales; verifier models trained on thinking-augmented negatives; audit logging of reasoning during eval.
- Assumptions/dependencies: Strict safety filters; human oversight; clear policies on storing and handling sensitive synthetic rationales.
- Lightweight trajectory generators for data-scale production (Engineering)
- What: Use smaller models (e.g., 1.5B) for thinking generation where shown effective, reducing cost without sacrificing training gains.
- Tools/products/workflows: Auto-selector that chooses generator size per domain; batch inference pipelines; caching for repeated contexts.
- Assumptions/dependencies: Validated quality of smaller generators; monitoring to catch low-quality or repetitive trajectories; guard against distribution shift.
Long-Term Applications
- Clinical reasoning support with auditable chains (Healthcare)
- What: TPT-trained domain LLMs that produce internal reasoning for guideline adherence, differential diagnosis, and care pathways.
- Tools/products/workflows: Med-specific TPT corpora; clinician-facing viewers for internal chains (for QA); verifier models checking consistency with guidelines.
- Assumptions/dependencies: Rigorous validation, regulatory approvals (e.g., FDA/CE), privacy/PHI controls, domain-grounded datasets.
- Contract analysis and legal argumentation (Legal)
- What: Reasoning-first models that surface structured arguments, issue spotting, and clause interactions with internal thought for review.
- Tools/products/workflows: “Argument maps” derived from trajectories; negotiation support tools; discovery triage assistants.
- Assumptions/dependencies: Liability frameworks; correctness guarantees; curated legal corpora; jurisdiction-specific nuances.
- Risk, compliance, and audit assistants (Finance)
- What: Systems that can produce traceable, step-wise rationale for risk assessments, controls testing, and regulatory filings.
- Tools/products/workflows: Reasoning logs bound to data lineage; scenario-analysis with explicit assumptions; audit dashboards.
- Assumptions/dependencies: Data privacy (e.g., GLBA), reproducibility requirements, human audit sign-off, model monitoring.
- Policy decision-support with transparency (Public Sector)
- What: Evidence synthesis and option appraisal with documented internal reasoning to increase accountability and auditability.
- Tools/products/workflows: Policy briefs with hidden/internal chains exposed to reviewers; M&E pipelines scoring consistency of rationale across sources.
- Assumptions/dependencies: Governance on disclosure of chain-of-thought; procurement standards; bias and fairness assessments.
- Language-driven robotic planning and process automation (Robotics/Manufacturing)
- What: Use TPT-style trajectories as priors for multi-step plans (task decomposition, retries, backtracking) bridging LLM planning and control stacks.
- Tools/products/workflows: Plan2Act stacks consuming “thinking trajectories” as candidate plans; simulators that score plan robustness; execution monitors.
- Assumptions/dependencies: Reliable grounding to perception/control; sim-to-real transfer; multi-modal integration.
- Multimodal thinking augmentation (Vision, Tables, Code, Speech)
- What: Extend TPT to images/diagrams/tables to create cross-modal, step-wise explanations improving general reasoning.
- Tools/products/workflows: Multimodal generators for trajectories; sequence-packing for long multimodal contexts; cross-modal verifiers.
- Assumptions/dependencies: Long-context attention efficiency; robust OCR/structure extraction; alignment across modalities.
- Standardized “thinking-augmented datasets” ecosystem
- What: Shared formats, licenses, and quality metrics for publishing and consuming thought-augmented corpora.
- Tools/products/workflows: Dataset schemas with provenance; quality dashboards (consistency, correctness, length distributions); marketplaces.
- Assumptions/dependencies: Community consensus on standards; IP/licensing clarity for synthetic content; governance to curb low-quality proliferation.
- Systems/hardware co-design for long-sequence training/inference
- What: Architectures and memory systems optimized for longer sequences and higher token budgets created by trajectory concatenation.
- Tools/products/workflows: Efficient attention variants; IO-aware dataloaders; memory-optimized KV caches; sequence-parallel training.
- Assumptions/dependencies: Vendor support; cost-performance gains vs. complexity; stability at scale.
- Watermarking, provenance, and safety for synthetic thinking
- What: Techniques to tag, trace, and filter synthetic trajectories; monitors to prevent contamination or misuse.
- Tools/products/workflows: Watermarking at generation; provenance fields in dataset metadata; detectors for overuse of synthetic style.
- Assumptions/dependencies: Reliable watermarking; adoption by data platforms; low false positive/negative rates.
- Automated prompt optimization and back-thinking loops in data curation
- What: AutoML pipelines that evolve prompts/generators to improve trajectory quality and training outcomes.
- Tools/products/workflows: Closed-loop evaluators that score downstream performance and adjust prompts/generator size; targeted “random focus point” variants where helpful.
- Assumptions/dependencies: Stable online/offline metrics; prevention of overfitting to benchmarks; compute budget for exploration.
- Regulation-aligned chain-of-thought handling policies
- What: Enterprise controls over when internal reasoning is generated, stored, or displayed; separation between training-only thought vs. user-facing outputs.
- Tools/products/workflows: Policy engines; configurable inference templates; redaction tools for thought content.
- Assumptions/dependencies: Evolving regulatory guidance; sector-specific norms; user consent and logging requirements.
- Predictive maintenance and troubleshooting with structured reasoning (Energy/Industrial IoT)
- What: Step-wise diagnostic models that explain fault trees and mitigation steps from manuals and logs.
- Tools/products/workflows: TPT-augmented corpora built from maintenance records; operator copilots; incident retrospectives with rationale.
- Assumptions/dependencies: Access to proprietary maintenance data; domain adaptation and validation; integration with existing CMMS/SCADA.
These applications leverage the core strengths demonstrated by TPT: scalable, annotation-free augmentation; dynamic compute allocation to difficult content; and strong gains on reasoning benchmarks. Feasibility depends on the availability and licensing of trajectory generators, careful safety and quality controls for synthetic content, and integration into existing data engineering and training stacks.
Glossary
- Abundant data: A training regime where data is plentiful so each sample is used at most once. "For pre-training under abundant data, each data sample is utilized at most once, assuming the dataset has been deduplicated."
- AIME24: A competitive math benchmark (American Invitational Mathematics Examination, 2024) used to evaluate reasoning. "AIME24 (from to , a increase)"
- AIME25: A competitive math benchmark (American Invitational Mathematics Examination, 2025) used to evaluate reasoning. "MATH-500, AIME24, AIME25, GSM8k, HMMT"
- BoolQ: A yes/no reading comprehension benchmark assessing general understanding. "BoolQ (0-shot)"
- Chain-of-Thought (CoT): A prompting method that elicits step-by-step reasoning from LLMs. "MMLU (2-shot with CoT)"
- Continual pre-training: Further pre-training of an already trained model on additional data to improve capabilities. "Mid-training, alternatively referred to as continual pre-training, enhances the capabilities of existing LLMs"
- Data engineering: The process of collecting, cleaning, and transforming data for large-scale training. "Modern data engineering pipelines"
- Deduplication: Removing duplicate data to avoid overfitting and data leakage. "then performs deduplication"
- Document-level training: Training techniques and operations that are applied at the granularity of entire documents. "operates at the document level"
- Document packing: Concatenating or grouping documents into fixed-length samples for efficient training. "Documents were packed into samples of $8$k tokens each"
- Domain balancing: Adjusting dataset composition so different domains are represented fairly. "parsing, deduplication, filtering, domain balancing, rewriting"
- Feynman technique: A method of learning by explaining concepts simply to ensure deep understanding. "Use Feynman technique whenever possible to ensure a deep understanding."
- Few-shot: Evaluation or training with only a handful of labeled examples. "The average few-shot accuracy scores on the GSM8k and MATH datasets"
- FineWeb-Edu: An education-focused subset of the FineWeb corpus used for LLM training. "FineWeb-Edu"
- Frontier models: The most advanced, cutting-edge LLMs trained on vast high-quality data. "existing frontier models"
- GPQA-Diamond: A difficult subset of the Graduate-Level Google-Proof Question Answering benchmark for general reasoning. "GPQA-Diamond, MMLU, JEEBench"
- GSM8k: A grade school math word problem benchmark used to test reasoning. "GSM8k (5-shot)"
- HMMT: A math competition dataset (Harvard-MIT Math Tournament) used for evaluating mathematical reasoning. "MATH-500, AIME24, AIME25, GSM8k, HMMT"
- HumanEval: A code generation benchmark assessing functional correctness of generated programs. "HumanEval and LiveCodeBench v4_v5"
- Inference compute: The computational resources allocated during model inference. "more difficult samples benefit from increased inference compute."
- JEEBench: A challenging benchmark based on Joint Entrance Examination problems for advanced reasoning. "JEEBench"
- LCB (LiveCodeBench): A coding benchmark evaluating code generation and execution correctness. "LiveCodeBench (LCB)"
- LLaMA: A family of open-source LLM architectures used as training baselines. "following the LLaMA-3-8B architecture"
- LLM-friendly format: Data formatting that makes content easier for LLMs to learn from. "transform raw text into a more LLM-friendly format"
- MATH: A math benchmark for complex problem-solving and reasoning. "MATH (4-shot)"
- MegaMath-Web-Pro-Max: A large-scale math-heavy corpus for pre-training and mid-training. "MegaMath-Web-Pro-Max"
- Mid-training: Continual pre-training applied to an existing model to improve capabilities. "Thinking Augmented Mid-training"
- Mixture-of-Thoughts: A public dataset of reasoning-rich samples used for supervised fine-tuning. "Mixture-of-Thoughts dataset"
- MMLU: A broad multi-task language understanding benchmark. "MMLU (2-shot with CoT)"
- MMLU-Pro: A more challenging and robust version of the MMLU benchmark. "MMLU"
- Next-token prediction loss: The standard autoregressive training objective minimizing log-likelihood of the next token. "minimize the standard next-token prediction loss"
- Online rollouts: Generating trajectories during training (often for RL), which increases compute cost. "does not require online rollouts"
- Perplexity: A measure of how well a LLM predicts a sample; lower is better. "the lower perplexity alone does not guarantee superior performance on downstream tasks."
- Polynomial division: A mathematical operation dividing one polynomial by another, relevant to reasoning tasks. "necessitate an understanding of polynomial division, the Remainder Theorem, and the properties of divisors."
- Remainder Theorem: A theorem relating polynomial evaluation to remainders, used in math reasoning. "necessitate an understanding of polynomial division, the Remainder Theorem, and the properties of divisors."
- Reinforcement learning: An optimization paradigm using rewards and rollouts to train models, often for reasoning. "propose to fine-tune LLMs with reinforcement learning to explicitly encourage the generation of long thinking trajectories."
- Reinforcement Pre-Training (RPT): A method applying reinforcement learning during pre-training to improve token prediction. "Compared to RPT~\citep{dong2025reinforcement}, our method does not require online rollouts"
- Sample weights: Weights assigned to training samples to balance sources or domains. "sample weights were adjusted to balance the different data sources."
- Scaling law: The empirical relationship showing performance improves with larger models and more data. "A foundational principle underpinning this success is the scaling law"
- SFT (Supervised Fine-Tuning): Post-training on labeled datasets to align models with desired behaviors. "This SFT dataset comprises $350$k samples"
- Synthetic data generation: Creating artificial training data via models or procedures to augment corpora. "synthetic data generation"
- Test-time scaling: Increasing inference compute or generation length to improve performance on harder inputs. "test-time scaling"
- Thinking augmented pre-training (TPT): A method that appends generated reasoning trajectories to data to improve learnability. "Thinking augmented Pre-Training (TPT)"
- Thinking pattern analysis: Examining properties of generated reasoning trajectories across domains and difficulty. "Thinking pattern analysis reveals that our method naturally up-samples high-quality data"
- Thinking trajectory: The sequence of intermediate reasoning steps appended to documents. "augmenting existing text data with thinking trajectories."
- Token budget: The allocated number of training tokens for a training phase. "with respect to the mid-training token budget."
- Up-sampling: Increasing the relative presence or training compute for certain data to emphasize it. "functions as a natural up-sampling mechanism."
- Vanilla next-token prediction: Standard pre-training without special augmentation or reasoning steps. "vanilla next-token prediction objective"
- Web-crawled corpora: Large text datasets gathered automatically from the web for training. "primarily derived from web-crawled corpora."
- Zero-shot: Evaluation without providing any labeled examples or demonstrations. "BoolQ (0-shot)"
Collections
Sign up for free to add this paper to one or more collections.