Audited Reasoning Refinement: Fine-Tuning Language Models via LLM-Guided Step-Wise Evaluation and Correction (2509.12476v1)

Published 15 Sep 2025 in cs.CL

Abstract: Training a task-specific small reasoning model is challenging when direct human supervision or high-quality labels are scarce. However, LLMs with reasoning capabilities produce abundant intermediate reasoning traces that can be systematically refined to create effective supervision signals. We propose Reason-Refine-then-Align (R2tA), which turns refined model rationales into supervision for training task-specific reasoning models. Our method generates initial reasoning and responses from an open-source base model on task-specific inputs, then refines these traces, fixing hallucinations and inconsistencies, to form a high-fidelity dataset. We perform a two-stage alignment, supervised fine-tuning (SFT), followed by direct preference optimization (DPO) to calibrate the model's intermediate reasoning with human-validated conceptual preferences and then condition the final output on that aligned reasoning. As a case study, we apply R2tA to evaluate extended entity relationship diagrams (EERDs) in database system design, a structurally complex task where prompt-only methods miss or hallucinate errors. We curated a dataset of 600 EERD variants (train/test split of 450/150, respectively) with induced mistakes spanning 11 categories. Empirical evaluation suggests R2tA provides a practical, cost-effective path to scalable LLM adaptation in data-scarce domains, enabling reproducible AI tools for education and beyond.

Summary

The paper introduces the R²tA method that refines LLM-generated rationales into high-fidelity supervision signals.
It employs a four-stage process incorporating reasoning refinement, alignment via SFT and DPO, and feedback refinement to address hallucinations and inconsistencies.
Experiments in EERD tasks demonstrate superior F1 scores, underscoring the method’s potential in high-stakes, label-scarce environments.

Introduction

The paper "Audited Reasoning Refinement: Fine-Tuning LLMs via LLM-Guided Step-Wise Evaluation and Correction" presents a novel approach to enhancing the accuracy and reliability of task-specific reasoning models in data-scarce domains. The challenges addressed concern generating effective reasoning models where high-quality labels and direct human supervision are limited, focusing on leveraging the inherent capabilities of LLMs in generating reasoning traces.

Methodology

The proposed method, termed Reason-Refine-then-Align (R $^2$ tA), aims to refine model-generated rationales into supervision signals for training task-specific reasoning models. The method involves generating initial reasoning and responses from an open-source base model for specific tasks, then refining these traces to correct errors, such as hallucinations and inconsistencies. This refined, high-fidelity dataset serves as a ground for a two-stage alignment process utilizing supervised fine-tuning (SFT) followed by direct preference optimization (DPO) to align intermediate reasoning with validated conceptual preferences.

To illustrate the method's application, the authors evaluate R $^2$ tA on an Extended Entity Relationship Diagram (EERD) evaluation task in database system education, a domain that benefits greatly from enhanced reasoning given its graph-like, constraint-heavy nature.

Implementation Details

The implementation of R $^2$ tA comprises four stages:

Reasoning Refinement: The process begins by obtaining reasoning and responses from an open-source base model, refining these traces iteratively. Technical refinement ensures precision and recall while maintaining coherence and readability.
Reasoning Alignment: The algorithm uses SFT to calibrate reasoning coherence and DPO to align intermediate steps with conceptual preferences reliably, ensuring outputs are grounded in these steps.
Feedback Refinement: Similar to reasoning refinement, this stage polishes feedback generated by the model using guided LLM audits for factual alignment.
Feedback Alignment: Refined feedback aligns with the reasoning steps through a series of tuned processes without resorting to separate reward models, ensuring that final outputs reflect well-structured reasoning paths.

Figures

Figure 1: The schema used for following rubric development.

Experimental Results

Empirical results compare R $^2$ tA against six baselines including variants without feedback DPO alignment. The R $^2$ tA framework demonstrates superior performance, achieving the highest F1 scores across complex reasoning task categories like Ternary Relationships and Specialization/Union, validating the efficacy of combining reasoning refinement and structured task feedback.

Ablation Studies

Ablation studies elucidate the importance of the reasoning and feedback alignment steps. The impact of incorporating DPO on top of SFT significantly improves model precision by reducing false positives, underlining the necessity of preference-based alignment.

Discussion

R $^2$ tA's approach aligns well with observations in prior work on the degradation of reasoning fidelity from traditional fine-tuning methods. Its iterative refinement strategy integrates conceptual validation with performance-oriented adjustments to yield models adept at managing complex, nuanced reasoning tasks without heavy reliance on abundant labeling.

Implications and Future Work

This research presents a pivotal methodology for extending LLM adaptations to domains requiring precise and reliable cognition in high-stakes environments like education and healthcare. Its approach of separating reasoning refinement from output alignment presents a promising direction for constructing scalable AI tools in label-scarce yet structured domains. Future work should explore broad applications across educational contexts and expand on interpretative capabilities with explicit causal modeling. Approaches such as applying structured auditing to diverse graph-structured domains and enhancing evaluation fidelity with programmatic checkers represent logical next steps in this ongoing development.

Conclusion

By decoupling refinement from alignment and systematically leveraging improvements at each stage, R $^2$ tA represents a significant advancement in LLM adaptation strategies, demonstrating enhanced robustness in tasks demanding rigorous reasoning accuracy and reliability, confirming its capability to offer generalizable solutions across structured, high-stakes problem domains.