Smaller Large Language Models Can Do Moral Self-Correction (2410.23496v1)

Published 30 Oct 2024 in cs.CL

Abstract: Self-correction is one of the most amazing emerging capabilities of LLMs, enabling LLMs to self-modify an inappropriate output given a natural language feedback which describes the problems of that output. Moral self-correction is a post-hoc approach correcting unethical generations without requiring a gradient update, making it both computationally lightweight and capable of preserving the LLMing ability. Previous works have shown that LLMs can self-debias, and it has been reported that small models, i.e., those with less than 22B parameters, are not capable of moral self-correction. However, there is no direct proof as to why such smaller models fall short of moral self-correction, though previous research hypothesizes that larger models are skilled in following instructions and understanding abstract social norms. In this paper, we empirically validate this hypothesis in the context of social stereotyping, through meticulous prompting. Our experimental results indicate that (i) surprisingly, 3.8B LLMs with proper safety alignment fine-tuning can achieve very good moral self-correction performance, highlighting the significant effects of safety alignment; and (ii) small LLMs are indeed weaker than larger-scale models in terms of comprehending social norms and self-explanation through CoT, but all scales of LLMs show bad self-correction performance given unethical instructions.

PDF HTML Abstract

A Comprehensive Overview of "Smaller LLMs Can Do Moral Self-Correction"

The paper "Smaller LLMs Can Do Moral Self-Correction" investigates the moral self-correction capability of LLMs with less than 22 billion parameters. While it has been noted in the literature that smaller LLMs appear insufficient for moral self-correction, this research provides an empirical validation of this capability in smaller models through fine-tuned safety alignment.

Key Contributions and Findings

The authors challenge the prevailing assumption that models smaller than 22 billion parameters are inept at moral self-correction. Through methodological prompting, they reveal significant findings:

Model Scale and Moral Self-Correction: Contrary to earlier beliefs, the paper shows that models with as few as 3.8 billion parameters can execute moral self-correction when appropriately fine-tuned with safety alignment techniques. This indicates the substantial role of safety alignment in enhancing moral self-correction without compromising the intrinsic LLMing abilities.
Instruction Following and Recognition of Norms: The research explores the ability of small LLMs to understand abstract social norms, follow instructions, and explain decisions in a Chain-of-Thought (CoT) manner. Tests conducted using prompts structured around specificity, negation, and CoT demonstrate that smaller models can indeed comprehend and act upon ethical instructions, albeit with lower effectiveness than larger models.
Effectiveness of Safety Alignment: The paper empirically validates that safety-aligned small LLMs, notably the phi-3 3.8B model, outperform some larger models when subjected to ethical decision-making tasks. The findings propose a model size threshold for the moral self-correction capability around 3.8 billion parameters, primarily facilitated by safety alignment.

Experimental Framework

The experimental design deploys a variety of LLM scales, including GPT-2, OLMo, Phi-3, and Llama-2, across a spectrum from 355 million to 70 billion parameters. Evaluation metrics are conducted using well-established benchmarks like Winogender for gender bias and BBQ for multiple forms of bias, each assessing different dimensions of bias and ethical reasoning.

The authors apply quantization techniques to enhance computational efficiency, especially with larger models, indicating that even optimized smaller models can perform ethically salient tasks with efficacy under aligned conditions.

Implications and Future Directions

The findings of this paper bear considerable implications for both theoretical exploration and practical applications:

Theoretical Insights: This research advances the understanding of model scalability in alignment with ethical instructions, offering a nuanced perspective that challenges the conception of a linear relationship between model size and moral self-correction capacity.
Practical Applications: For applications requiring ethical interaction, such as dialogue systems and decision support tools, smaller LLMs provide a more resource-efficient alternative to larger models, given proper safety alignment.
Future Research Directions: The manuscript suggests future inquiries might investigate the varying behavioral dynamics of LLMs across tasks when confronted with unethical instructions. Furthermore, this paper implies that the augmentation of ethical alignment techniques could significantly enhance smaller models' moral reasoning.

Conclusion

The paper "Smaller LLMs Can Do Moral Self-Correction" illuminates the overlooked potential of smaller LLMs in moral self-correction under the guidance of safety alignment methods. By challenging preconceived notions about scale and effectiveness, it opens prospects for more resource-efficient deployment of LLMs with ethical awareness, advocating continued research into optimizing alignment methodologies across different model sizes.

PDF Markdown Bookmark Chat (Pro)

Authors (4)

Guangliang Liu (10 papers)
Zhiyu Xue (10 papers)
Rongrong Wang (48 papers)
Kristen Marie Johnson (6 papers)

Smaller Large Language Models Can Do Moral Self-Correction (2410.23496v1)

A Comprehensive Overview of "Smaller LLMs Can Do Moral Self-Correction"

Key Contributions and Findings

Experimental Framework

Implications and Future Directions

Conclusion

Related Papers

YouTube

HackerNews