Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 64 tok/s

Gemini 2.5 Pro 47 tok/s Pro

GPT-5 Medium 27 tok/s Pro

GPT-5 High 31 tok/s Pro

GPT-4o 102 tok/s Pro

Kimi K2 206 tok/s Pro

GPT OSS 120B 463 tok/s Pro

Claude Sonnet 4.5 35 tok/s Pro

2000 character limit reached

Adapting Safe-for-Work Classifier for Malaysian Language Text: Enhancing Alignment in LLM-Ops Framework (2407.20729v1)

Published 30 Jul 2024 in cs.CL

Abstract: As LLMs become increasingly integrated into operational workflows (LLM-Ops), there is a pressing need for effective guardrails to ensure safe and aligned interactions, including the ability to detect potentially unsafe or inappropriate content across languages. However, existing safe-for-work classifiers are primarily focused on English text. To address this gap for the Malaysian language, we present a novel safe-for-work text classifier tailored specifically for Malaysian language content. By curating and annotating a first-of-its-kind dataset of Malaysian text spanning multiple content categories, we trained a classification model capable of identifying potentially unsafe material using state-of-the-art natural language processing techniques. This work represents an important step in enabling safer interactions and content filtering to mitigate potential risks and ensure responsible deployment of LLMs. To maximize accessibility and promote further research towards enhancing alignment in LLM-Ops for the Malaysian context, the model is publicly released at https://huggingface.co/malaysia-ai/malaysian-sfw-classifier.

References (10)

Summary

The paper introduces a novel adaptation of a safe-for-work classifier tailored to Malaysian text, addressing critical gaps in LLM-Ops content moderation.
It integrates manual annotation with automated techniques like knowledge distillation, centroid-based filtering, and sentiment polarity filtering to refine data quality.
Empirical results reveal the Malaysian-Mistral model achieving 87.68% accuracy, underscoring its effectiveness in improving multilingual content safety.

Adapting Safe-for-Work Classifier for Malaysian Language Text: Enhancing Alignment in LLM-Ops Framework

The paper "Adapting Safe-for-Work Classifier for Malaysian Language Text: Enhancing Alignment in LLM-Ops Framework" by Aisyah Razak et al. addresses a significant gap in the field of NLP by developing a Safe-for-Work (SFW) text classifier specifically tailored for the Malaysian language. As LLMs become more prevalent in operational workflows (LLM-Ops), the need for effective moderation of potentially harmful or inappropriate content becomes critical. This paper introduces a novel approach to ensuring safe interactions within LLM-Ops frameworks by focusing on the Malaysian language context.

Introduction

The increasing integration of LLMs into chatbot applications and dialogue systems has necessitated the development of robust mechanisms to filter out harmful content, which is inherent due to the nature of training data sourced from the internet. Previous work in AI moderation has predominantly centered on English texts, with a noticeable dearth of efforts directed at non-English languages such as Malay. Addressing this gap, the authors create and annotate a dataset of Malaysian text encompassing various categories of harmful content, such as pornography, harassment, sexist remarks, racism, religious insults, self-harm, psychiatric or mental illness remarks, and violence.

Taxonomy and Data Source

The paper introduces a meticulous taxonomy for categorizing harmful content, which serves as a guideline for implementing guardrails in LLM systems. The taxonomy includes labels such as pornography, harassment, sexist, racist, religious insult, self-harm, psychiatric or mental illness, violence, and safe-for-work content. The data sources span social media platforms, public forums, and existing datasets, ensuring a comprehensive representation of harmful text in the Malaysian context.

Methodology

The authors employ a multi-faceted methodology incorporating manual labeling, knowledge distillation from LLMs, centroid-based filtering, sentiment polarity filtering, and active learning to construct an effectively labeled dataset. Notably, knowledge distillation leverages pre-trained LLMs like mistral-7b and MaLLaM-small to generate initial labels, which are later refined through manual annotation and active learning cycles.

Manual Labeling

Approximately 200 data points were manually annotated using Label Studio, ensuring high-quality baseline data for initial training phases.

Knowledge Distillation and Centroid-Based Filtering

The annotated data is augmented via knowledge distillation from LLMs using a specific prompt to achieve consistent labeling. Further, centroid-based filtering is employed to eliminate outliers based on Euclidean distance computations, thereby enhancing the dataset's coherence.

Sentiment Polarity Filtering

To refine the data further, sentiment polarity filtering is applied, removing positively or neutrally connoted sentences from categories such as harassment or self-harm, ensuring the dataset contains only relevant negative instances.

Active Learning

Active learning is implemented iteratively, training the classifier on initially labeled data, using it to predict labels on unlabeled data, and refining the model until satisfactory performance is achieved.

Results and Analysis

The performance of various models is evaluated on the Malaysian NSFW dataset with metrics including accuracy, precision, recall, and F1 score. The mesolitica/malaysian-mistral-191M-MLM model emerged as the best performer, achieving an accuracy of 0.8768, and demonstrating strong precision, recall, and F1 score. The robustness of the model can be attributed to the LLM2Vec approach, enabling it to act as a powerful text encoder in an unsupervised fashion.

The authors also provide a comprehensive analysis through 2D embedding visualizations using UMAP and topic modeling via TFIDF-LDA, capturing the nuanced vocabulary associated with each label. Word clouds further illustrate the predominant terms within each category, highlighting the model's efficacy in identifying harmful content types.

Conclusion and Implications

This paper represents a pioneering effort in developing an SFW classifier for the Malaysian language, contributing significantly to the field of AI safety and content moderation. The methodology demonstrates a sophisticated integration of manual annotation, advanced NLP techniques, and iterative learning processes, setting a precedent for future work in non-English language NLP applications.

The implications of this research are multifaceted. Practically, it provides a framework for deploying LLMs in diverse linguistic contexts with enhanced safety measures, crucial for applications in multilingual societies. Theoretically, it opens avenues for further exploration into LLM adaptations across varying languages and cultural contexts, potentially advancing the state of cross-lingual NLP alignment in AI safety.

As LLMs continue to evolve, the focus on inclusive, multilingual, and culturally aware safety mechanisms will be paramount. Future developments may include refining classifiers to better handle nuanced variations in harmful content across languages and integrating these systems seamlessly into broader LLM-Ops frameworks.

This paper sets a critical foundational step towards safer AI deployment in the Malaysian context, paving the way for more inclusive and effective content moderation mechanisms in the global landscape of LLMs.