Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
Gemini 2.5 Pro
GPT-5
GPT-4o
DeepSeek R1 via Azure
2000 character limit reached

ProtoReasoning: Prototypes as the Foundation for Generalizable Reasoning in LLMs (2506.15211v1)

Published 18 Jun 2025 in cs.CL

Abstract: Recent advances in Large Reasoning Models (LRMs) trained with Long Chain-of-Thought (Long CoT) reasoning have demonstrated remarkable cross-domain generalization capabilities. However, the underlying mechanisms supporting such transfer remain poorly understood. We hypothesize that cross-domain generalization arises from shared abstract reasoning prototypes -- fundamental reasoning patterns that capture the essence of problems across domains. These prototypes minimize the nuances of the representation, revealing that seemingly diverse tasks are grounded in shared reasoning structures.Based on this hypothesis, we propose ProtoReasoning, a framework that enhances the reasoning ability of LLMs by leveraging scalable and verifiable prototypical representations (Prolog for logical reasoning, PDDL for planning).ProtoReasoning features: (1) an automated prototype construction pipeline that transforms problems into corresponding prototype representations; (2) a comprehensive verification system providing reliable feedback through Prolog/PDDL interpreters; (3) the scalability to synthesize problems arbitrarily within prototype space while ensuring correctness. Extensive experiments show that ProtoReasoning achieves 4.7% improvement over baseline models on logical reasoning (Enigmata-Eval), 6.3% improvement on planning tasks, 4.0% improvement on general reasoning (MMLU) and 1.0% on mathematics (AIME24). Significantly, our ablation studies confirm that learning in prototype space also demonstrates enhanced generalization to structurally similar problems compared to training solely on natural language representations, validating our hypothesis that reasoning prototypes serve as the foundation for generalizable reasoning in LLMs.

Summary

  • The paper introduces ProtoReasoning, a framework that uses abstract reasoning prototypes to improve LLM performance on diverse tasks.
  • It employs a dual-module approach with a Prototype Constructor for standardizing problem representations and a Verification System to validate outputs.
  • Experimental results demonstrate significant gains in logical reasoning, planning, and generalization benchmarks, confirming the framework's efficacy.

This paper introduces "ProtoReasoning," a framework designed to enhance the reasoning capabilities of LLMs by leveraging abstract reasoning prototypes. The central hypothesis is that cross-domain generalization in LLMs arises from shared, fundamental reasoning patterns, or "prototypes," which capture the essence of problems across different domains. By training models on these prototypes, the framework aims to improve their performance and generalization on tasks requiring similar underlying reasoning structures.

The ProtoReasoning framework has two main components:

  1. Prototype Constructor: This module transforms problems from natural language or other formats into their corresponding prototype representations.
  2. Verification System: This module evaluates the correctness of the model's outputs within the prototype representation space, providing reliable feedback.

The paper focuses on two specific domains for demonstrating ProtoReasoning:

  • Logical Reasoning using Prolog: Prolog is chosen for its declarative nature, expressiveness in first-order predicate logic, and verifiability.
    • Prolog-based Logic Prototype Constructor: A four-stage, model-driven pipeline is used:
    • 1. Data Initialization: Collects reasoning problems from web sources.
    • 2. Prototype Transformation: Uses LLMs and prompt engineering to convert natural language problems into Prolog, with outputs standardized to JSON.
    • 3. Data Evolution: Employs prompt engineering to increase problem complexity while maintaining JSON output.
    • 4. Answer Derivation: Uses the SWI-Prolog interpreter to derive ground-truth answers, eliminating the need for pre-existing answer pairs.
    • The resulting dataset is $\mathcal{D}_\mathrm{Prolog} = \left\{ {\langle \mathcal{Q}_{\mathrm{Prolog}, \mathcal{A} \rangle}_{i} \right\}$.
    • Prolog-based Verification System: Model predictions are also generated in JSON format, allowing for direct comparison with the interpreter's JSON output for verification. A specific prompt template guides the model to simulate SWI-Prolog execution.
  • Planning using PDDL (Planning Domain Definition Language): PDDL is used for its standard representation of automated planning problems, modeling state transitions, actions, preconditions, and effects.
    • PDDL-based Planning Prototype Constructor: Uses PDDL-Generator and FastDownward to create problems and solutions across domains like BlocksWorld and Logistics. Three task types are formulated:
    • 1. Plan Generation: Generate a complete action sequence.
    • 2. Plan Completion: Fill in missing steps in a partial action sequence.
    • 3. Plan Reordering: Determine a valid execution sequence from unordered actions.
    • The resulting dataset is $\mathcal{D}_\mathrm{PDDL} = \left\{ {\langle \mathcal{Q}_{\mathrm{PDDL}, \mathcal{P}_\mathrm{ref} \rangle}_{i} \right\}$.
    • PDDL-based Verification System: Uses VAL (PDDL Plan Validator) with custom verification for each task:
    • 1. Plan Generation: Accepts any VAL-verified plan. For optimization tasks, plans must also meet optimality criteria determined by FastDownward.
    • 2. Plan Completion: Model output must pass VAL and include all actions from the partial plan in their original positions.
    • 3. Plan Reordering: Model output must pass VAL and contain the same set of actions as the unordered input.
    • Prompt templates are used to guide model generation in verifiable formats.

Model Training Recipe

A three-phase Supervised Fine-Tuning (SFT) process is employed:

  1. Teacher Model Distillation: Deepseek-R1 is used to generate explicit Chain-of-Thought (CoT) reasoning paths for the initial Prolog and PDDL datasets, creating augmented datasets (DPrologAug\mathcal{D}_\mathrm{Prolog}^{\mathrm{Aug}} and DPDDLAug\mathcal{D}_\mathrm{PDDL}^{\mathrm{Aug}}). The base model is fine-tuned on these.
  2. Difficulty Stratification: The model from phase 1 is used with rejection sampling (evaluating each problem 10 times) to classify instances by difficulty (Challenging, Intermediate, Elementary). Perfectly solved or completely failed instances are excluded. An enhanced model is trained on this stratified dataset.
  3. Quality Filtration: The model from phase 2 undergoes a final round of rejection sampling to create the definitive training dataset.

Experimental Setup

  • Dataset: The baseline dataset includes 100K samples from the Seed Project. The ProtoReasoning training used 4,196 high-quality Prolog problems and 2,424 PDDL tasks after the three-phase processing.
  • Evaluation:
    • Logical Reasoning: Enigmata-Eval benchmark (excluding 4 datasets due to instruction-following issues).
    • Planning: Internal planning test set (Task Planning) and Nexus-Hard benchmark (for function calling as an indirect measure).
    • Out-of-Domain Generalization: MMLU (general reasoning) and AIME24 (mathematical reasoning).
    • All benchmarks are 0-shot evaluations, with results averaged over 3 runs (10 for AIME24).
  • Model: A Mixture-of-Experts (MoE) architecture with 15B activated parameters (150B total). Key hyperparameters include a learning rate of 2e-5, batch size of 6, AdamW optimizer, and a maximum sequence length of 32768.

Experimental Results

ProtoReasoning showed significant improvements:

  • Enigmata-Eval (Logical Reasoning): 42.0% (up by 4.7% from baseline 37.3%).
  • Nexus-Hard (Planning via Function Calling): 59.5% (up by 6.4% from baseline 53.1%).
  • Task Planning (Direct Planning): 53.0% (up by 6.3% from baseline 46.7%).
  • MMLU (General Reasoning): 86.7% (up by 4.0% from baseline 82.7%).
  • AIME24 (Mathematical Reasoning): 73.0% (up by 1.0% from baseline 72.0%).

Ablation Study

  • Setup: The Enigmata-Eval benchmark was split. A matched training corpus of 453 samples was created, with problems in both Prolog and natural language formats.
    • Method 1 (Baseline): Trained on standard dataset.
    • Method 2 (+ Prolog): Standard dataset + Prolog versions of Enigmata problems.
    • Method 3 (+ Natural Language): Standard dataset + natural language versions of the same Enigmata problems.
  • Evaluation:
    • Prototype Transfer Set: Original natural language versions of Enigmata problems used for Prolog training.
    • Development Set: Remaining Enigmata-Eval problems not in the transfer set.
  • Results:
    • Training with Prolog representations (Method 2) significantly outperformed the baseline (Method 1) on both transfer and development sets, showing effective generalization to natural language problems.
    • Method 2 achieved performance comparable to Method 3 (training on natural language), supporting the hypothesis that prototype training captures generalizable reasoning structures.
    • An experiment training on Prolog without CoT showed dramatically reduced performance, highlighting the importance of explicit reasoning processes for generalization through prototypes.
    • Category-wise analysis showed that with sufficient samples, Prolog prototype training matched or exceeded natural language training performance.

Conclusion and Future Work

The paper concludes that ProtoReasoning validates the hypothesis that abstract reasoning prototypes form a foundation for cross-domain generalization. Training on Prolog and PDDL representations improved logical reasoning and planning, with effective transfer to structurally similar problems. The authors suggest the framework could be generalized to other LLM capabilities. Future work includes:

  • Developing more rigorous mathematical frameworks for "reasoning prototypes."
  • Investigating the underlying mechanisms of cross-domain transfer more deeply.
  • Open-sourcing the curated Prolog and PDDL datasets.
  • Reproducing results on open-source LLMs for broader validation.
Youtube Logo Streamline Icon: https://streamlinehq.com