Papers

Topics

Authors

Recent

View all

Detailed Answer

Quick Answer

Concise responses based on abstracts only

Detailed Answer

Well-researched responses based on abstracts and relevant paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses

Gemini 2.5 Flash

Gemini 2.5 Flash 47 tok/s

Gemini 2.5 Pro 37 tok/s Pro

GPT-5 Medium 15 tok/s Pro

GPT-5 High 11 tok/s Pro

GPT-4o 101 tok/s Pro

Kimi K2 195 tok/s Pro

GPT OSS 120B 465 tok/s Pro

Claude Sonnet 4 30 tok/s Pro

2000 character limit reached

AutoRule: Reasoning Chain-of-thought Extracted Rule-based Rewards Improve Preference Learning (2506.15651v1)

Published 18 Jun 2025 in cs.LG, cs.AI, and cs.CL

Abstract: Rule-based rewards offer a promising strategy for improving reinforcement learning from human feedback (RLHF), but current approaches often rely on manual rule engineering. We present AutoRule, a fully automated method for extracting rules from preference feedback and formulating them into rule-based rewards. AutoRule extraction operates in three stages: it leverages a reasoning model to interpret user preferences, identifies candidate rules from the reasoning chain of these interpretations, and synthesizes them into a unified rule set. Leveraging the finalized rule set, we employ language-model verifiers to compute the fraction of rules satisfied by each output, using this metric as an auxiliary reward alongside the learned reward model during policy optimization. Training a Llama-3-8B model with AutoRule results in a 28.6\% relative improvement in length-controlled win rate on AlpacaEval2.0, and a 6.1\% relative gain in second-turn performance on a held-out MT-Bench subset, compared to a GRPO baseline trained with the same learned reward model but without the rule-based auxiliary reward. Our analysis confirms that the extracted rules exhibit good agreement with dataset preference. We find that AutoRule demonstrates reduced reward hacking compared to a learned reward model when run over two episodes. Finally, our case study suggests that the extracted rules capture unique qualities valued in different datasets. The extracted rules are provided in the appendix, and the code is open-sourced at https://github.com/cxcscmu/AutoRule.

Collections

Summary

Rule-Based Rewards in RLHF

The paper "Reasoning Chain-of-thought Extracted Rule-based Rewards Improve Preference Learning," authored by Tevin Wang and Chenyan Xiong from Carnegie Mellon University's School of Computer Science, explores an innovative approach to optimize Reinforcement Learning from Human Feedback (RLHF) in LLMs. The work addresses the prevalent challenges associated with utilizing manually engineered rules for RLHF by presenting an automated framework for extracting rules from preference data.

Methodology Overview

The authors introduce a pipeline that formulates rule-based rewards by leveraging reasoning capabilities of advanced LLMs. Their approach involves a three-stage process starting with the generation of reasoning chains from model outputs and preference labels. Candidate rules are extracted from these reasoning chains and synthesized into a consolidated set of actionable rules. The rules serve as objective criteria for a language-model-based verifier, which computes the fraction of rules satisfied by each output. This score is then used as an auxiliary reward during policy optimization, integrated with the standard reward model.

To ensure robust verification, the rule-based system uses binary rule satisfaction judgments, simplifying reward modeling and hence diminishing the likelihood of reward hacking. The authors implement their framework with a Llama-3-8B model, showing empirical results that demonstrate substantial improvements compared to baseline models.

Experimental Validation

The authors have validated their approach through extensive experiments, employing both LLM-generated reasoning chains and human judgments. They report a 28.6% absolute improvement in length-controlled win rate on AlpacaEval2.0, alongside significant gains in multi-turn interaction performance on a held-out subset of MT-Bench, compared with a GRPO baseline without rule-based auxiliaries. These results underscore the efficacy of the automated rule extraction process in capturing preference criteria inherent to different datasets and minimizing reward hacking.

Comparative Analysis and Implications

The paper critically compares reasoning chain-based rule extraction with rule derivation from model-generated justifications. The experimental results reveal that rules extracted through reasoning chains contribute more effectively to model performance improvements, validated through enhanced win rates and reduced overoptimization tendencies.

The implications of this research are twofold—practically, automating rule extraction reduces dependency on costly, manual engineering efforts, offering a scalable solution for alignment tasks. Theoretically, this paper posits that rule-based objectives, especially those critiqued from reasoning chains, provide more interpretable alignment signals, potentially guiding future developments in AI alignment and RLHF.

Future Developments

Further research may explore extending the framework across diverse domains to assess its ability to generalize across various tasks. Moreover, constructing a formal theoretical foundation around the mechanisms by which rule-based rewards reduce reward hacking would bolster the framework's utility and adoption in industrial applications.

Overall, this paper provides significant insights into rule-based rewards in RLHF, presenting a robust methodology that enhances preference learning through automated reasoning chain extraction, establishing a benchmark for future advances in AI alignment.