Determine whether large language models can self-correct rule violations without fine-tuning

Determine whether large language models can autonomously self-correct violations of formal rules without specific fine-tuning, as assessed across reasoning tasks.

Background

The authors synthesize findings on LLM reasoning limits, noting uncertainty about self-correction capabilities in the absence of fine-tuning. This unresolved question bears on the reliability of LLMs in formal domains and their potential for robust reasoning.

References

Models are unlikely to know when they are violating formal rules and it is unclear whether they can self-correct~\, but with specific fine-tuning they might self-correct against harmful text~\, and that training on generated data might not be the best approach to preserve reasoning about outlier cases~.

Functional Benchmarks for Robust Evaluation of Reasoning Performance, and the Reasoning Gap (2402.19450 - Srivastava et al., 29 Feb 2024) in Related Work, Understanding the bounds of reasoning, generalization, and memorization in large language models