- The paper introduces a novel distillation approach that leverages Chain-of-Thought prompting to transfer rationales from large to smaller models.
- It achieves an F1 score of approximately 0.85, with a 9% improvement over the teacher model and 13% over the baseline.
- Human evaluations confirm 91% rationale correctness and 100% completeness, ensuring transparent and effective hate speech detection.
Towards Efficient and Explainable Hate Speech Detection via Model Distillation
The paper focuses on the intersection of computational efficiency and interpretability in the automatic detection of hate speech, presenting a compelling case for the use of model distillation in this context. Hate speech on the internet is a significant problem that technology must address effectively and transparently. While LLMs like Llama-3-70B-Instruct have been successful in detecting hate speech, their high computational costs hinder their practicality in operational settings.
In this study, the authors explore a method by which insights from larger LLMs can be transferred to smaller, more efficient models, specifically through the use of Chain-of-Thought prompting combined with model distillation. The key innovation is in demonstrating that these distilled models can maintain the quality of explanations generated by their larger counterparts while improving their classification performance.
The proposed approach leverages rationales extracted from the Llama-3-70B-Instruct model using Few-Shot Chain-of-Thought prompting. These rationales are then used to fine-tune a smaller model, Llama-3-8B-Instruct, within a multi-task learning framework. This entails the smaller model learning both classification labels and associated rationales, a process that enhances transparency in hate speech detection.
Empirical results from the study show that the distilled model, referred to as Llama-3-8B-Distil-MetaHate, not only retains the ability to generate high-quality explanations but also outperforms its larger predecessor in classification accuracy, achieving an F1-score of approximately 0.85. This score represents a 9% improvement over the teacher model and a 13% improvement over the baseline smaller model.
Moreover, the quantitative analysis is complemented by qualitative assessment through human evaluation of the generated rationales, ensuring that they are both complete, i.e., capturing all instances of hate speech, and correct, i.e., providing accurate justifications. The distilled model exhibits performance on par with the larger model, achieving a correctness rate of 91% and a completeness rate of 100% in human evaluations.
The practical implications of this work are substantial. The smaller models derived from this distillation process are not only faster and less resource-intensive to deploy but also maintain robust explanatory capabilities. This is particularly vital given regulatory frameworks like the Digital Services Act, which necessitate transparency in content moderation practices on digital platforms.
However, the study does highlight certain limitations, such as the narrow scope of training data and the need for further exploration into the generalization capabilities across different languages and cultural contexts. Future research directions could involve alternative prompting techniques and assessing the method's effectiveness across various domains and datasets.
To conclude, the paper presents significant advancements in making AI-driven hate speech detection scalable and interpretable by marrying the strengths of LLMs with model distillation techniques. These contributions are poised to enhance both the efficiency of such systems and the trust users place in them, demonstrating a promising path forward for applying AI in content moderation and other similar applications.