Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

General-Reasoner: Advancing LLM Reasoning Across All Domains (2505.14652v5)

Published 20 May 2025 in cs.CL

Abstract: Reinforcement learning (RL) has recently demonstrated strong potential in enhancing the reasoning capabilities of LLMs. Particularly, the "Zero" reinforcement learning introduced by Deepseek-R1-Zero, enables direct RL training of base LLMs without relying on an intermediate supervised fine-tuning stage. Despite these advancements, current works for LLM reasoning mainly focus on mathematical and coding domains, largely due to data abundance and the ease of answer verification. This limits the applicability and generalization of such models to broader domains, where questions often have diverse answer representations, and data is more scarce. In this paper, we propose General-Reasoner, a novel training paradigm designed to enhance LLM reasoning capabilities across diverse domains. Our key contributions include: (1) constructing a large-scale, high-quality dataset of questions with verifiable answers curated by web crawling, covering a wide range of disciplines; and (2) developing a generative model-based answer verifier, which replaces traditional rule-based verification with the capability of chain-of-thought and context-awareness. We train a series of models and evaluate them on a wide range of datasets covering wide domains like physics, chemistry, finance, electronics etc. Our comprehensive evaluation across these 12 benchmarks (e.g. MMLU-Pro, GPQA, SuperGPQA, TheoremQA, BBEH and MATH AMC) demonstrates that General-Reasoner outperforms existing baseline methods, achieving robust and generalizable reasoning performance while maintaining superior effectiveness in mathematical reasoning tasks.

Summary

  • The paper introduces General-Reasoner, a novel paradigm that improves LLM reasoning by leveraging a diverse, verifiable dataset and a model-based verifier.
  • It details a generative verifier that generates a chain-of-thought process to accurately score model responses and provide reward signals for reinforcement learning.
  • Results show that General-Reasoner achieves state-of-the-art performance on reasoning benchmarks, demonstrating robust, multi-domain applicability.

This paper introduces General-Reasoner, a novel training paradigm aimed at improving the reasoning capabilities of LLMs across a wide variety of domains, moving beyond the traditional focus on mathematics and coding. The core contributions are the creation of a diverse, large-scale dataset with verifiable answers and the development of a generative model-based verifier to facilitate effective reinforcement learning (RL) training.

The challenge addressed is that existing RL methods for LLM reasoning are often limited by the availability of high-quality, verifiable data outside of math and coding, and by the limitations of rule-based answer verification which struggles with diverse answer formats found in other disciplines. General-Reasoner tackles these issues by:

  1. Constructing a Diverse Verifiable Reasoning Dataset (WebInstruct-verified): The dataset is built by starting with the large WebInstruct dataset [yue2024mammoth2], re-crawling original web sources to extract explicit question-answer pairs verified by humans. State-of-the-art LLMs (Gemini-1.5-Pro and Gemini-2.0-Flash) are then used to filter for questions with clearly verifiable short answers, categorize them by subject and answer type, and filter out unsolvable or trivially easy questions based on generated candidate solutions. The final dataset contains approximately 230,000 high-quality reasoning questions spanning diverse domains like physics, chemistry, finance, electronics, and more, with varied answer formats including multiple-choice, numerical expressions, matrices, strings, etc. This dataset is highlighted as a key resource for training generalizable reasoning.
  2. Developing a Generative Model-Based Verifier (General-Verifier): Recognizing the limitations of rigid rule-based verifiers on diverse answer types (semantic insensitivity, lack of generality), the authors propose a compact generative model-based verifier. This verifier is trained using the candidate solutions and verification annotations generated during the dataset creation process by Gemini-2.0-Flash. It is a 1.5B parameter model, initialized from Qwen2.5-Math-1.5B [yang2024qwen25mathtechnicalreportmathematical]. Given a question, ground truth answer, and a student-generated answer, it generates a chain-of-thought process and a binary prediction (ylabely_{label}) indicating whether the student answer is equivalent to the ground truth in context. This approach provides a more flexible and robust method for obtaining accurate reward signals for RL training across diverse domains, especially for non-mathematical tasks where answers can have varied representations.

The General-Reasoner training paradigm utilizes the Zero RL setting, directly applying reinforcement learning (specifically, the GRPO algorithm [grpo]) to base LLMs without an initial supervised fine-tuning step. Models were initialized from Qwen2.5 and Qwen3 base models of various sizes (4B, 7B, 14B). The model-based verifier is integrated into the GRPO training loop to provide reward signals. The reward structure assigns a positive reward (1.0 with a length penalty) for answers verified as correct by the General-Verifier and a negative reward (-0.5) if solution extraction fails.

Practical Implementation Details:

  • Dataset: The paper mentions constructing and releasing the WebInstruct-verified dataset, which is crucial for replication and application. Implementing this requires accessing this dataset.
  • Verifier: The Generative Model-Based Verifier is a separate, smaller LLM (1.5B parameters). It needs to be trained separately using data derived from a powerful teacher model (like Gemini-2.0-Flash in the paper's pipeline). Once trained, this verifier model is used during the RL training phase to score the generated responses. Implementing the verifier involves:
    • Choosing a base model for the verifier (e.g., a small Qwen model).
    • Gathering or generating training data for the verifier (question, ground truth, student answer, verification label + optional CoT).
    • Fine-tuning the verifier model on this data.
    • Integrating the trained verifier into the RL training loop to provide rewards.
    • A typical input/output example for the verifier shows it reasoning about the equivalence of different mathematical expressions for the same solution.
  • RL Training: The implementation is based on the verl repository. This involves:
    • Loading the base LLM (e.g., Qwen3-14B-Base).
    • Loading the trained General-Verifier model.
    • Setting up the GRPO algorithm.
    • Processing batches of questions from the WebInstruct-verified dataset.
    • Generating multiple responses (rollouts) for each question using the current policy model.
    • Using the General-Verifier to evaluate the generated responses and assign rewards.
    • Calculating the GRPO objective and performing policy updates.
    • Hyperparameters for training are detailed in the appendix (e.g., learning rate 5e-7, batch sizes, clipping ratios, KL coefficients, temperature, rollout number). Training requires significant computational resources (multiple nodes with H100 GPUs).
  • Inference: The trained General-Reasoner model is a standard LLM and can be used for inference like any other. The verifier is primarily needed during training. While the trained models can generate chain-of-thought outputs, the evaluation shows that even when instructed to provide a final answer directly (non-think mode), their performance is strong, suggesting robust underlying reasoning.

Real-World Applications:

The General-Reasoner approach expands the applicability of RL-trained LLMs beyond specialized domains like math or coding. This is particularly useful for applications requiring reasoning across various scientific, technical, or even humanities fields, such as:

  • Expert Q&A Systems: Building models that can answer complex questions in specific domains (e.g., physics, chemistry, finance, engineering) with high accuracy and verifiable solutions.
  • Educational Tools: Creating AI tutors or assistants that can help students solve problems and understand concepts across a broad curriculum.
  • Knowledge Assistants: Developing systems that can perform analysis and provide justified answers in professional settings like finance or legal research.
  • Complex Problem Solving: Empowering AI to tackle multi-disciplinary problems that require combining knowledge from different fields.

Implementation Considerations:

  • Computational Resources: Training General-Reasoner models, especially larger ones, is computationally expensive, requiring multi-GPU setups.
  • Data Quality: The quality and diversity of the WebInstruct-verified dataset are critical. The rigorous filtering process described in the paper is essential for effective training.
  • Verifier Accuracy: The performance of the General-Reasoner model is directly tied to the accuracy and reliability of the General-Verifier. Training a high-quality verifier is a prerequisite. The paper shows the model-based verifier significantly outperforms rule-based ones, especially on diverse domains and answer types.
  • Generalization vs. Specialization: While General-Reasoner aims for broad generalization, fine-tuning on domain-specific datasets after this general training could further enhance performance on highly specialized tasks.
  • Response Length and Efficiency: The paper notes that their models do not suffer from excessive output length, leading to faster inference compared to some other reasoning methods. This is a practical benefit for deployment.

The empirical evaluation demonstrates that General-Reasoner models, trained with diverse data and the model-based verifier, achieve state-of-the-art performance among open-source models on general reasoning benchmarks (MMLU-Pro, GPQA, SuperGPQA, TheoremQA, BBEH) and maintain strong performance on math benchmarks, often surpassing models specifically trained on math data. The best model, General-Reasoner-Qw3-14B, shows performance competitive with or exceeding GPT-4o on some general benchmarks like GPQA and TheoremQA. Ablation studies confirm the importance of both the diverse dataset and the model-based verifier for achieving these results.