Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 61 tok/s

Gemini 2.5 Pro 51 tok/s Pro

GPT-5 Medium 37 tok/s Pro

GPT-5 High 33 tok/s Pro

GPT-4o 95 tok/s Pro

Kimi K2 171 tok/s Pro

GPT OSS 120B 440 tok/s Pro

Claude Sonnet 4.5 37 tok/s Pro

2000 character limit reached

Circuit-Aware Reward Training: A Mechanistic Framework for Longtail Robustness in RLHF (2509.24713v1)

Published 29 Sep 2025 in cs.LG and cs.AI

Abstract: Reinforcement Learning from Human Feedback (RLHF) reward models exhibit systematic failures on longtail distributions, leading to reward hacking and misalignment. We propose a mechanistic interpretability framework that identifies specialized neural circuits responsible for rare-event processing in reward models. Drawing from recent advances showing distributed specialization for rare tokens in LLMs\citep{liu2025no, liu2025emergent}, we hypothesize that reward models also develop functionally distinct circuits for longtail scenarios. Our theoretical framework establishes formal connections between circuit specialization, reward generalization bounds, and longtail performance. We introduce \textbf{Circuit-Aware Reward Training (CART)}, which uses circuit analysis to guide data augmentation, regularization, and ensemble strategies. This approach provides both theoretical insights into reward model failures and practical interventions for improving longtail robustness.