Determine whether large language models can self-correct rule violations without fine-tuning
Determine whether large language models can autonomously self-correct violations of formal rules without specific fine-tuning, as assessed across reasoning tasks.
Sponsor
References
Models are unlikely to know when they are violating formal rules and it is unclear whether they can self-correct~\, but with specific fine-tuning they might self-correct against harmful text~\, and that training on generated data might not be the best approach to preserve reasoning about outlier cases~.
— Functional Benchmarks for Robust Evaluation of Reasoning Performance, and the Reasoning Gap
(2402.19450 - Srivastava et al., 29 Feb 2024) in Related Work, Understanding the bounds of reasoning, generalization, and memorization in large language models