- The paper introduces ProtoReasoning, a framework that uses abstract reasoning prototypes to improve LLM performance on diverse tasks.
- It employs a dual-module approach with a Prototype Constructor for standardizing problem representations and a Verification System to validate outputs.
- Experimental results demonstrate significant gains in logical reasoning, planning, and generalization benchmarks, confirming the framework's efficacy.
This paper introduces "ProtoReasoning," a framework designed to enhance the reasoning capabilities of LLMs by leveraging abstract reasoning prototypes. The central hypothesis is that cross-domain generalization in LLMs arises from shared, fundamental reasoning patterns, or "prototypes," which capture the essence of problems across different domains. By training models on these prototypes, the framework aims to improve their performance and generalization on tasks requiring similar underlying reasoning structures.
The ProtoReasoning framework has two main components:
- Prototype Constructor: This module transforms problems from natural language or other formats into their corresponding prototype representations.
- Verification System: This module evaluates the correctness of the model's outputs within the prototype representation space, providing reliable feedback.
The paper focuses on two specific domains for demonstrating ProtoReasoning:
- Logical Reasoning using Prolog: Prolog is chosen for its declarative nature, expressiveness in first-order predicate logic, and verifiability.
- Prolog-based Logic Prototype Constructor: A four-stage, model-driven pipeline is used:
- 1. Data Initialization: Collects reasoning problems from web sources.
- 2. Prototype Transformation: Uses LLMs and prompt engineering to convert natural language problems into Prolog, with outputs standardized to JSON.
- 3. Data Evolution: Employs prompt engineering to increase problem complexity while maintaining JSON output.
- 4. Answer Derivation: Uses the SWI-Prolog interpreter to derive ground-truth answers, eliminating the need for pre-existing answer pairs.
- The resulting dataset is $\mathcal{D}_\mathrm{Prolog} = \left\{ {\langle \mathcal{Q}_{\mathrm{Prolog}, \mathcal{A} \rangle}_{i} \right\}$.
- Prolog-based Verification System: Model predictions are also generated in JSON format, allowing for direct comparison with the interpreter's JSON output for verification. A specific prompt template guides the model to simulate SWI-Prolog execution.
- Planning using PDDL (Planning Domain Definition Language): PDDL is used for its standard representation of automated planning problems, modeling state transitions, actions, preconditions, and effects.
- PDDL-based Planning Prototype Constructor: Uses PDDL-Generator and FastDownward to create problems and solutions across domains like BlocksWorld and Logistics. Three task types are formulated:
- 1. Plan Generation: Generate a complete action sequence.
- 2. Plan Completion: Fill in missing steps in a partial action sequence.
- 3. Plan Reordering: Determine a valid execution sequence from unordered actions.
- The resulting dataset is $\mathcal{D}_\mathrm{PDDL} = \left\{ {\langle \mathcal{Q}_{\mathrm{PDDL}, \mathcal{P}_\mathrm{ref} \rangle}_{i} \right\}$.
- PDDL-based Verification System: Uses VAL (PDDL Plan Validator) with custom verification for each task:
- 1. Plan Generation: Accepts any VAL-verified plan. For optimization tasks, plans must also meet optimality criteria determined by FastDownward.
- 2. Plan Completion: Model output must pass VAL and include all actions from the partial plan in their original positions.
- 3. Plan Reordering: Model output must pass VAL and contain the same set of actions as the unordered input.
- Prompt templates are used to guide model generation in verifiable formats.
Model Training Recipe
A three-phase Supervised Fine-Tuning (SFT) process is employed:
- Teacher Model Distillation: Deepseek-R1 is used to generate explicit Chain-of-Thought (CoT) reasoning paths for the initial Prolog and PDDL datasets, creating augmented datasets (DPrologAug and DPDDLAug). The base model is fine-tuned on these.
- Difficulty Stratification: The model from phase 1 is used with rejection sampling (evaluating each problem 10 times) to classify instances by difficulty (Challenging, Intermediate, Elementary). Perfectly solved or completely failed instances are excluded. An enhanced model is trained on this stratified dataset.
- Quality Filtration: The model from phase 2 undergoes a final round of rejection sampling to create the definitive training dataset.
Experimental Setup
- Dataset: The baseline dataset includes 100K samples from the Seed Project. The ProtoReasoning training used 4,196 high-quality Prolog problems and 2,424 PDDL tasks after the three-phase processing.
- Evaluation:
- Logical Reasoning: Enigmata-Eval benchmark (excluding 4 datasets due to instruction-following issues).
- Planning: Internal planning test set (Task Planning) and Nexus-Hard benchmark (for function calling as an indirect measure).
- Out-of-Domain Generalization: MMLU (general reasoning) and AIME24 (mathematical reasoning).
- All benchmarks are 0-shot evaluations, with results averaged over 3 runs (10 for AIME24).
- Model: A Mixture-of-Experts (MoE) architecture with 15B activated parameters (150B total). Key hyperparameters include a learning rate of 2e-5, batch size of 6, AdamW optimizer, and a maximum sequence length of 32768.
Experimental Results
ProtoReasoning showed significant improvements:
- Enigmata-Eval (Logical Reasoning): 42.0% (up by 4.7% from baseline 37.3%).
- Nexus-Hard (Planning via Function Calling): 59.5% (up by 6.4% from baseline 53.1%).
- Task Planning (Direct Planning): 53.0% (up by 6.3% from baseline 46.7%).
- MMLU (General Reasoning): 86.7% (up by 4.0% from baseline 82.7%).
- AIME24 (Mathematical Reasoning): 73.0% (up by 1.0% from baseline 72.0%).
Ablation Study
- Setup: The Enigmata-Eval benchmark was split. A matched training corpus of 453 samples was created, with problems in both Prolog and natural language formats.
- Method 1 (Baseline): Trained on standard dataset.
- Method 2 (+ Prolog): Standard dataset + Prolog versions of Enigmata problems.
- Method 3 (+ Natural Language): Standard dataset + natural language versions of the same Enigmata problems.
- Evaluation:
- Prototype Transfer Set: Original natural language versions of Enigmata problems used for Prolog training.
- Development Set: Remaining Enigmata-Eval problems not in the transfer set.
- Results:
- Training with Prolog representations (Method 2) significantly outperformed the baseline (Method 1) on both transfer and development sets, showing effective generalization to natural language problems.
- Method 2 achieved performance comparable to Method 3 (training on natural language), supporting the hypothesis that prototype training captures generalizable reasoning structures.
- An experiment training on Prolog without CoT showed dramatically reduced performance, highlighting the importance of explicit reasoning processes for generalization through prototypes.
- Category-wise analysis showed that with sufficient samples, Prolog prototype training matched or exceeded natural language training performance.
Conclusion and Future Work
The paper concludes that ProtoReasoning validates the hypothesis that abstract reasoning prototypes form a foundation for cross-domain generalization. Training on Prolog and PDDL representations improved logical reasoning and planning, with effective transfer to structurally similar problems. The authors suggest the framework could be generalized to other LLM capabilities.
Future work includes:
- Developing more rigorous mathematical frameworks for "reasoning prototypes."
- Investigating the underlying mechanisms of cross-domain transfer more deeply.
- Open-sourcing the curated Prolog and PDDL datasets.
- Reproducing results on open-source LLMs for broader validation.