Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 92 tok/s
Gemini 2.5 Pro 59 tok/s Pro
GPT-5 Medium 22 tok/s
GPT-5 High 29 tok/s Pro
GPT-4o 94 tok/s
GPT OSS 120B 471 tok/s Pro
Kimi K2 212 tok/s Pro
2000 character limit reached

BOLT: Bootstrap Long Chain-of-Thought in Language Models without Distillation (2502.03860v1)

Published 6 Feb 2025 in cs.CL

Abstract: LLMs, such as o1 from OpenAI, have demonstrated remarkable reasoning capabilities. o1 generates a long chain-of-thought (LongCoT) before answering a question. LongCoT allows LLMs to analyze problems, devise plans, reflect, and backtrack effectively. These actions empower LLM to solve complex problems. After the release of o1, many teams have attempted to replicate its LongCoT and reasoning capabilities. In terms of methods, they primarily rely on knowledge distillation with data from existing models with LongCoT capacities (e.g., OpenAI-o1, Qwen-QwQ, DeepSeek-R1-Preview), leaving significant uncertainties on systematically developing such reasoning abilities. In terms of data domains, these works focus narrowly on math while a few others include coding, limiting their generalizability. This paper introduces a novel approach to enable LLM's LongCoT capacity without distillation from o1-like models or expensive human annotations, where we bootstrap LongCoT (BOLT) from a standard instruct model. BOLT involves three stages: 1) LongCoT data bootstrapping with in-context learning on a standard instruct model; 2) LongCoT supervised finetuning; 3) online training to further refine LongCoT capacities. In BOLT, only a few in-context examples need to be constructed during the bootstrapping stage; in our experiments, we created 10 examples, demonstrating the feasibility of this approach. We use Llama-3.1-70B-Instruct to bootstrap LongCoT and apply our method to various model scales (7B, 8B, 70B). We achieve impressive performance on a variety of benchmarks, Arena-Hard, MT-Bench, WildBench, ZebraLogic, MATH500, which evaluate diverse task-solving and reasoning capabilities.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper introduces BOLT, a novel three-stage method (bootstrapping, SFT, online training) that enables ShortCoT models to acquire Long Chain-of-Thought capabilities without distillation or manual data.
  • BOLT generates initial LongCoT data using in-context learning with short-chain models, then refines the model through supervised fine-tuning and online training via algorithms like DPO.
  • Applying BOLT improves performance across various benchmarks (MT-Bench, Arena-Hard, etc.) for different model scales, demonstrating effective reasoning skill development with minimal initial LongCoT examples.

The paper introduces Bootstrapping Long Chain-of-Thought (BOLT), a novel methodology designed to imbue LLMs with Long Chain-of-Thought (LongCoT) capabilities absent reliance on knowledge distillation from existing LongCoT models or extensive manual annotation. The method bootstraps LongCoT from a ShortCoT LLM through three stages: LongCoT data bootstrapping, LongCoT supervised fine-tuning, and LongCoT online training.

The introduction posits that while almost all modern LLMs can reason via chain-of-thought prompting techniques [Wei et al., 2022], regular LLMs exhibit simpler behavior compared to models such as o1 from OpenAI. The paper defines o1-like models, which generate long chain-of-thoughts with reasoning behavior, as LongCoT models and regular LLMs as ShortCoT models. The paper asserts that previous attempts to replicate LongCoT rely primarily on knowledge distillation using data from existing LongCoT models, which leaves gaps in understanding how to systematically develop such reasoning skills.

The related works section discusses OpenAI's o1 model [Jaech et al., 2024], which employs LongCoTs to leverage reasoning actions. This enhances model performance in areas such as mathematics, coding, and scientific problems. The section also references a concurrent work by DeepSeek [Guo et al., 2025] demonstrating that reinforcement learning applied to a 671B parameter model can yield LongCoT capabilities. The paper states that existing Reinforcement Learning from Human Feedback (RLHF) methods focus on single-stage response generation and lack mechanisms for models to revise, backtrack, or critique their own internal thought processes.

BOLT comprises three stages. First, LongCoT bootstrapping synthesizes LongCoT data. Second, LongCoT supervised finetuning trains a ShortCoT model to adapt to the LongCoT format, incorporating reasoning elements and practicing extended chains of thought before arriving at an external solution. Third, LongCoT online training refines the LongCoT SFT model through online exploration and on-policy refinement.

Key notations are introduced, where xx represents a query, zz denotes internal thoughts, and yy indicates an external solution. MM is used to denote off-the-shelf LLMs, and TT represents models or policies trained in the experiments.

(y,z)Mbootstrapping(y,zfformatting(x,DICL))(y, z) \sim M_\text{bootstrapping}(y, z | f_\text{formatting}(x, D_\text{ICL}))

  • yy: external solution
  • zz: internal thoughts
  • MbootstrappingM_\text{bootstrapping}: ShortCoT LLM used to generate LongCoT data
  • fformattingf_\text{formatting}: template that wraps xx and DICLD_\text{ICL} as an LLM input
  • xx: query
  • DICLD_\text{ICL}: collection of in-context examples

In the LongCoT with in-context learning stage, in-context examples of LongCoT are leveraged to prompt ShortCoT models. Each instance includes a long-form chain-of-thought and its corresponding solution derived from the reasoning process. The LongCoT incorporates problem analysis, planning, branching, and reflection. In query mixture curation, a query distribution covers a wide range of topics. The query curation pipeline involves query collection, difficulty scoring/filtering, and topic tagging/sub-sampling. Seven criteria are considered: specificity, domain knowledge, complexity, problem-solving, creativity, technical accuracy, and real-world application. A binary (0/1) label is assigned to each query on each criterion, and the quality or difficulty level of each query is determined by the total over the seven criteria.

In response filtering, heuristics and rules filter out data where the responses (y,z)(y, z) do not follow the format as demonstrated in the in-context examples. Each response consists of zz (internal thoughts) and yy (an external solution). An outcome reward model (ORM) is used to access the quality of yy and filter data instances based on its quality score. After the filtering steps, the high-quality LongCoT data Dbootstrapping={(x,y,z)}D_\text{bootstrapping} = \{(x, y, z)\} is obtained.

With DbootstrappingD_\text{bootstrapping}, supervised finetuning is conducted on a ShortCoT model to allow it to learn long-form chain-of-thought and reasoning elements involved in it and the format of first producing internal thoughts and then an external response, leading to an initial LongCoT model, T0T_0.

With the SFT model T0T_0 as an initialization, online training is conducted to further improve the policy, Tθ(y,zx)T_\theta(y, z | x), and it involves maximizing the reward with a conservative constraint objective:

maxTθE(x,y,z)Donline,(y,z)Tθ(y,zx)[rθ(x,z,y)]βDKL[Tθ(y,zx)T0(y,zx)],\max_{T_\theta} \mathbb{E}_{(x, y, z) \sim D_\text{online}, (y, z) \sim T_\theta(y, z|x)} [r_\theta(x, z, y)] - \beta D_\text{KL}[T_\theta(y, z | x) || T_0(y, z | x)],

  • xx: query
  • yy: external solution
  • zz: internal thoughts
  • TθT_\theta: policy model
  • DonlineD_\text{online}: online dataset
  • rθr_\theta: reward model
  • β\beta: regularization coefficient
  • DKLD_\text{KL}: Kullback-Leibler divergence

The objective can be instantiated with variants such as DPO, REINFORCE, RLOO, PPO. The reward model assigns a score to yy given xx, that is, rθ:X×YRr_\theta : \mathcal{X} \times \mathcal{Y} \rightarrow \mathbb{R}. In practice, a rule-based format reward is included to facilitate model response following the defined format.

The experiment section demonstrates that BOLT effectively develops LongCoT capacities in ShortCoT LLMs. The setup focuses on evaluating models' reasoning capabilities across diverse domains, with an emphasis on real-world queries. The benchmarks include MT-Bench [Zheng et al., 2023], Arena-Hard [Li et al., 2024a], WildBench [Lin et al., 2024b], ZebraLogic [Lin et al., 2024a], and MATH500 [Lightman et al., 2023].

BOLT is applied to Mistral-7B-Instruct-v0.3 [Jiang et al., 2023], Meta-Llama-3.1-8B-Instruct [Grattafiori et al., 2024], and Meta-Llama-3.1-70B-Instruct [Grattafiori et al., 2024] to test the effectiveness of our method across different model scales. In the first stage of BOLT, LongCoT Bootstrapping generates a dataset of 220k instances. LongCoT supervised finetuning is performed on this dataset for 4 epochs.

In LongCoT online training, eight samples are sampled given each query. ArmoRM-Llama3-8B [Wang et al., 2024] is used as the reward model. DPO trianing hyperparameters include a regularization coefficient of β=0.1\beta = 0.1, a learning rate of $5e-7$ with a cosine scheduler and a warm-up ratio of 0.1, a batch size of 128, and AdamW as the optimizer. Online training is conducted over 3 iterations, and each iteration consists of 2 epochs.

The results demonstrate performance improvements on diverse benchmarks. The benchmarks feature challenging real-user queries and assess models on math, coding, logical problem-solving, and general capabilities. A performance trajectory during the BOLT training process shows that after Bootstrapping SFT, performance gains are significant compared to the initial model. Also, LongCoT online training via DPO consistently boosts performance.

An ablation on reward models investigates the impact of the reward model in the online DPO training process by comparing ArmoRM-Llama3-8B and Skywork-Reward-Llama-3.1-8B [Liu et al., 2024]. In an ablation on initial models, BOLT-Llama-3.1-8B-Base, while not performing as well as BOLT-Llama-3.1-8B-Instruct, surpasses Meta-Llama-3.1-8B-Instruct. In an ablation on online training algorithms, DPO outperforms REINFORCE, RLOO, and PPO.

In conclusion, the paper presents BOLT, a three-stage approach that bootstraps LongCoT capabilities from ShortCoT models. A finding is that the bootstrapping stage requires only 10 examples to initiate the process.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Paper Prompts

Sign up for free to create and run prompts on this paper using GPT-5.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.