Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
121 tokens/sec
GPT-4o
9 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

CodePMP: Scalable Preference Model Pretraining for Large Language Model Reasoning (2410.02229v2)

Published 3 Oct 2024 in cs.AI and cs.CL

Abstract: LLMs have made significant progress in natural language understanding and generation, driven by scalable pretraining and advanced finetuning. However, enhancing reasoning abilities in LLMs, particularly via reinforcement learning from human feedback (RLHF), remains challenging due to the scarcity of high-quality preference data, which is labor-intensive to annotate and crucial for reward model (RM) finetuning. To alleviate this issue, we introduce CodePMP, a scalable preference model pretraining (PMP) pipeline that utilizes a large corpus of synthesized code-preference pairs from publicly available high-quality source code. CodePMP improves RM finetuning efficiency by pretraining preference models on large-scale synthesized code-preference pairs. We evaluate CodePMP on mathematical reasoning tasks (GSM8K, MATH) and logical reasoning tasks (ReClor, LogiQA2.0), consistently showing significant improvements in reasoning performance of LLMs and highlighting the importance of scalable preference model pretraining for efficient reward modeling.

Citations (1)

Summary

  • The paper introduces CodePMP, a novel pipeline that pretrains preference models using synthesized code-preference pairs to enhance Large Language Model reasoning.
  • CodePMP addresses data scarcity in RLHF by automatically generating large-scale preference datasets from public source code, pairing responses from stronger and weaker CodeLLMs.
  • Experimental results show CodePMP significantly improves performance and sample efficiency on mathematical and logical reasoning tasks, demonstrating generalizability across different LLMs.

This paper introduces CodePMP (Preference Model Pretraining), a novel pipeline designed to improve the reasoning capabilities of LLMs by pretraining preference models on synthesized code-preference pairs. The core idea is to leverage the abundance of high-quality source code available publicly to generate large-scale preference datasets, which can then be used to initialize reward models effectively.

The authors address the limitations of current Reinforcement Learning from Human Feedback (RLHF) methods, which often suffer from the scarcity and high cost of annotating high-quality preference data, particularly for complex reasoning tasks. CodePMP circumvents these issues by automatically generating preference pairs from source code, capitalizing on the logical and structured nature of code to enhance LLM reasoning.

Here's a breakdown of the key components and findings:

  • CodePMP Methodology:
    • The process begins with collecting and cleaning raw code from GitHub.
    • A description summarizer, typically an instruction-tuned CodeLLM, generates prompts that describe the code's functionality.
    • Two CodeLLMs with different capabilities are then used to generate code snippets based on these prompts: a stronger CodeLLM generates a "chosen" response, while a weaker model produces a "rejected" response.
    • These <chosen, rejected> pairs are accumulated to form a large-scale synthesized preference dataset.
    • The preference model is pretrained using pairwise ranking objectives on this dataset, providing a strong initialization for downstream RM finetuning.
  • Model Design:
    • CodePMP training involves both Reward Modeling (RM) and LLMing (LM).
    • In RM, the model learns to assign higher scores to the chosen code through a pairwise ranking loss:

      LRM=log(σ(schosensrejected))\mathcal{L}_{RM} = -\log(\sigma(s_{chosen} - s_{rejected}))

      • LRM\mathcal{L}_{RM}: Reward Modeling loss
      • σ\sigma: sigmoid function
      • schosens_{chosen}: reward score for the chosen response
      • srejecteds_{rejected}: reward score for the rejected response
    • In LM, only the chosen code is used for autoregressive training to maintain general language capabilities.
    • The overall loss function LPMP\mathcal{L}_{PMP} is a combination of the RM and LM losses:

      LPMP=Lrank+LLM\mathcal{L}_{PMP} = \mathcal{L}_{rank} + \mathcal{L}_{LM}

      • LPMP\mathcal{L}_{PMP}: Code Preference Model Pretraining loss
      • Lrank\mathcal{L}_{rank}: ranking loss
      • LLM\mathcal{L}_{LM}: LLMing loss
  • Experimental Evaluation:
    • The effectiveness of CodePMP was evaluated on mathematical reasoning tasks (GSM8K and MATH) and logical reasoning tasks (ReClor and LogiQA2.0).
    • The models were evaluated using RM accuracy and Best-of-N (BoN) accuracy.
    • The paper found that CodePMP significantly improves RM finetuning accuracy and BoN performance across all reasoning tasks. For example, the reward models finetuned with CodePMP initialization achieved higher accuracies on the reasoning holdout test sets.
    • The results also indicate that CodePMP enhances the sample efficiency of RM finetuning.
  • Key Findings and Contributions:
    • CodePMP improves sample efficiency and robustness for downstream RM finetuning by using code-derived preference pairs to pretrain preference models.
    • The approach significantly enhances performance on reasoning tasks, demonstrating the positive impact of a scalable PMP process on LLM reasoning abilities.
    • The paper offers a detailed analysis of key design elements in CodePMP, providing insights for future research.
    • Increasing the number of code-preference pairs consistently improves BoN accuracy in both mathematical and logical reasoning tasks across model sizes, with no sign of diminishing returns.
  • Ablation Studies:
    • The paper includes ablation studies that compare different pair construction methods, such as using different models to generate chosen and rejected responses. Results showed that pairing positive samples from a 7B parameter model with negative samples from a 1.5B parameter model consistently delivered the best performance.
    • The paper also compares GitHub-sourced code with web-crawled data. GitHub-sourced pairs consistently outperform those from web platforms, particularly as the number of solutions (N) increases.
    • Additionally, the impact of including an end-of-context (EOC) token and different learning rate schedulers were examined.
  • Generalizability:
    • CodePMP was validated on the Gemma-2B model, resulting in significant performance gains in both mathematical and logical reasoning, which highlights its broad applicability across diverse LLM architectures.
    • Performance gains on coding RM and general RM (RMBench) evaluations show that CodePMP not only improves reasoning tasks but also generalizes well across various tasks.

Overall, the CodePMP pipeline offers a scalable and cost-effective solution for enhancing LLM reasoning capabilities. By leveraging the wealth of publicly available source code, CodePMP addresses the data scarcity issue in RLHF and provides a pathway toward more efficient and robust reward modeling for reasoning tasks.

The authors outline two key directions for future work: CodePrMP, which will focus on utilizing compiler and interpreter verifiability to provide low-cost process supervision signals, and GenPMP, which will explore how to improve sample efficiency and the performance of generative reward models by integrating code data.