- The paper introduces BOLT, a novel three-stage method (bootstrapping, SFT, online training) that enables ShortCoT models to acquire Long Chain-of-Thought capabilities without distillation or manual data.
- BOLT generates initial LongCoT data using in-context learning with short-chain models, then refines the model through supervised fine-tuning and online training via algorithms like DPO.
- Applying BOLT improves performance across various benchmarks (MT-Bench, Arena-Hard, etc.) for different model scales, demonstrating effective reasoning skill development with minimal initial LongCoT examples.
The paper introduces Bootstrapping Long Chain-of-Thought (BOLT), a novel methodology designed to imbue LLMs with Long Chain-of-Thought (LongCoT) capabilities absent reliance on knowledge distillation from existing LongCoT models or extensive manual annotation. The method bootstraps LongCoT from a ShortCoT LLM through three stages: LongCoT data bootstrapping, LongCoT supervised fine-tuning, and LongCoT online training.
The introduction posits that while almost all modern LLMs can reason via chain-of-thought prompting techniques [Wei et al., 2022], regular LLMs exhibit simpler behavior compared to models such as o1 from OpenAI. The paper defines o1-like models, which generate long chain-of-thoughts with reasoning behavior, as LongCoT models and regular LLMs as ShortCoT models. The paper asserts that previous attempts to replicate LongCoT rely primarily on knowledge distillation using data from existing LongCoT models, which leaves gaps in understanding how to systematically develop such reasoning skills.
The related works section discusses OpenAI's o1 model [Jaech et al., 2024], which employs LongCoTs to leverage reasoning actions. This enhances model performance in areas such as mathematics, coding, and scientific problems. The section also references a concurrent work by DeepSeek [Guo et al., 2025] demonstrating that reinforcement learning applied to a 671B parameter model can yield LongCoT capabilities. The paper states that existing Reinforcement Learning from Human Feedback (RLHF) methods focus on single-stage response generation and lack mechanisms for models to revise, backtrack, or critique their own internal thought processes.
BOLT comprises three stages. First, LongCoT bootstrapping synthesizes LongCoT data. Second, LongCoT supervised finetuning trains a ShortCoT model to adapt to the LongCoT format, incorporating reasoning elements and practicing extended chains of thought before arriving at an external solution. Third, LongCoT online training refines the LongCoT SFT model through online exploration and on-policy refinement.
Key notations are introduced, where x represents a query, z denotes internal thoughts, and y indicates an external solution. M is used to denote off-the-shelf LLMs, and T represents models or policies trained in the experiments.
(y,z)∼Mbootstrapping(y,z∣fformatting(x,DICL))
- y: external solution
- z: internal thoughts
- Mbootstrapping: ShortCoT LLM used to generate LongCoT data
- fformatting: template that wraps x and DICL as an LLM input
- x: query
- DICL: collection of in-context examples
In the LongCoT with in-context learning stage, in-context examples of LongCoT are leveraged to prompt ShortCoT models. Each instance includes a long-form chain-of-thought and its corresponding solution derived from the reasoning process. The LongCoT incorporates problem analysis, planning, branching, and reflection. In query mixture curation, a query distribution covers a wide range of topics. The query curation pipeline involves query collection, difficulty scoring/filtering, and topic tagging/sub-sampling. Seven criteria are considered: specificity, domain knowledge, complexity, problem-solving, creativity, technical accuracy, and real-world application. A binary (0/1) label is assigned to each query on each criterion, and the quality or difficulty level of each query is determined by the total over the seven criteria.
In response filtering, heuristics and rules filter out data where the responses (y,z) do not follow the format as demonstrated in the in-context examples. Each response consists of z (internal thoughts) and y (an external solution). An outcome reward model (ORM) is used to access the quality of y and filter data instances based on its quality score. After the filtering steps, the high-quality LongCoT data Dbootstrapping={(x,y,z)} is obtained.
With Dbootstrapping, supervised finetuning is conducted on a ShortCoT model to allow it to learn long-form chain-of-thought and reasoning elements involved in it and the format of first producing internal thoughts and then an external response, leading to an initial LongCoT model, T0.
With the SFT model T0 as an initialization, online training is conducted to further improve the policy, Tθ(y,z∣x), and it involves maximizing the reward with a conservative constraint objective:
TθmaxE(x,y,z)∼Donline,(y,z)∼Tθ(y,z∣x)[rθ(x,z,y)]−βDKL[Tθ(y,z∣x)∣∣T0(y,z∣x)],
- x: query
- y: external solution
- z: internal thoughts
- Tθ: policy model
- Donline: online dataset
- rθ: reward model
- β: regularization coefficient
- DKL: Kullback-Leibler divergence
The objective can be instantiated with variants such as DPO, REINFORCE, RLOO, PPO. The reward model assigns a score to y given x, that is, rθ:X×Y→R. In practice, a rule-based format reward is included to facilitate model response following the defined format.
The experiment section demonstrates that BOLT effectively develops LongCoT capacities in ShortCoT LLMs. The setup focuses on evaluating models' reasoning capabilities across diverse domains, with an emphasis on real-world queries. The benchmarks include MT-Bench [Zheng et al., 2023], Arena-Hard [Li et al., 2024a], WildBench [Lin et al., 2024b], ZebraLogic [Lin et al., 2024a], and MATH500 [Lightman et al., 2023].
BOLT is applied to Mistral-7B-Instruct-v0.3 [Jiang et al., 2023], Meta-Llama-3.1-8B-Instruct [Grattafiori et al., 2024], and Meta-Llama-3.1-70B-Instruct [Grattafiori et al., 2024] to test the effectiveness of our method across different model scales. In the first stage of BOLT, LongCoT Bootstrapping generates a dataset of 220k instances. LongCoT supervised finetuning is performed on this dataset for 4 epochs.
In LongCoT online training, eight samples are sampled given each query. ArmoRM-Llama3-8B [Wang et al., 2024] is used as the reward model. DPO trianing hyperparameters include a regularization coefficient of β=0.1, a learning rate of $5e-7$ with a cosine scheduler and a warm-up ratio of 0.1, a batch size of 128, and AdamW as the optimizer. Online training is conducted over 3 iterations, and each iteration consists of 2 epochs.
The results demonstrate performance improvements on diverse benchmarks. The benchmarks feature challenging real-user queries and assess models on math, coding, logical problem-solving, and general capabilities. A performance trajectory during the BOLT training process shows that after Bootstrapping SFT, performance gains are significant compared to the initial model. Also, LongCoT online training via DPO consistently boosts performance.
An ablation on reward models investigates the impact of the reward model in the online DPO training process by comparing ArmoRM-Llama3-8B and Skywork-Reward-Llama-3.1-8B [Liu et al., 2024]. In an ablation on initial models, BOLT-Llama-3.1-8B-Base, while not performing as well as BOLT-Llama-3.1-8B-Instruct, surpasses Meta-Llama-3.1-8B-Instruct. In an ablation on online training algorithms, DPO outperforms REINFORCE, RLOO, and PPO.
In conclusion, the paper presents BOLT, a three-stage approach that bootstraps LongCoT capabilities from ShortCoT models. A finding is that the bootstrapping stage requires only 10 examples to initiate the process.