Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 69 tok/s

Gemini 2.5 Pro 53 tok/s Pro

GPT-5 Medium 42 tok/s Pro

GPT-5 High 41 tok/s Pro

GPT-4o 120 tok/s Pro

Kimi K2 191 tok/s Pro

GPT OSS 120B 459 tok/s Pro

Claude Sonnet 4.5 36 tok/s Pro

2000 character limit reached

One-step and Two-step Classification for Abusive Language Detection on Twitter (1706.01206v1)

Published 5 Jun 2017 in cs.CL

Abstract: Automatic abusive language detection is a difficult but important task for online social media. Our research explores a two-step approach of performing classification on abusive language and then classifying into specific types and compares it with one-step approach of doing one multi-class classification for detecting sexist and racist languages. With a public English Twitter corpus of 20 thousand tweets in the type of sexism and racism, our approach shows a promising performance of 0.827 F-measure by using HybridCNN in one-step and 0.824 F-measure by using logistic regression in two-steps.

Citations (347)

View on Semantic Scholar

Summary

The paper's main contribution is showing that a one-step HybridCNN model achieves an F-measure of 0.827, nearly matching the two-step approach using logistic regression.
The paper employs various CNN architectures, including CharCNN, WordCNN, and an innovative HybridCNN that integrates character and word-level features for improved detection.
The paper implies that while both methods perform comparably, the two-step process offers modular scalability and flexibility for addressing diverse abusive language challenges.

A Methodological Analysis of One-step and Two-step Classification for Abusive Language Detection on Twitter

The paper "One-step and Two-step Classification for Abusive Language Detection on Twitter" by Ji Ho Park and Pascale Fung explores the automated classification of abusive language, focusing specifically on Twitter data. In the field of social media, abusive language detection is a complex yet critical task for maintaining a safe and respectful platform. The authors explore the efficacy of one-step versus two-step classification approaches to tackle this problem, aiming to optimize the detection of sexist and racist language.

Methodological Framework

The authors employ a public English Twitter corpus comprising 20,000 tweets, annotated as sexist, racist, or neither. The paper compares the effectiveness of a one-step multi-class classification strategy against a two-step approach. The one-step model categorizes tweets directly into "none," "sexism," or "racism," while the two-step model first identifies "abusive" language and subsequently distinguishes between sexist and racist content. This bifurcation could potentially enhance model precision by reducing the complexity of the initial classification task.

The core classification method centers around several convolutional neural network (CNN) architectures. These include CharCNN, WordCNN, and the newly proposed HybridCNN, which synthesizes character-level and word-level inputs to optimize feature capture. The authors implement these CNNs with robust word embeddings using word2vec pre-trained on a substantial corpus. Additionally, Logistic Regression (LR) using character n-grams serves as a comparative baseline.

Experimental Findings

The paper proposes that a one-step classification using HybridCNN yields an F-measure of 0.827, which stands nearly equivalent to the two-step classification approach yielding an F-measure of 0.824 with logistic regression in the latter phase. These results indicate marginal differences in performance between the two approaches. HybridCNN's efficacy supports its capacity for nuanced feature detection by leveraging character and word inputs concurrently, outperforming more simplistic models such as WordCNN and CharCNN. The research underscores that the hybrid architecture, alongside logistic regression in the two-step method, achieves effective recall and precision, particularly in accurately identifying and categorizing nuanced abusive languages.

Implications and Future Directions

The paper emphasizes that while different classification methods demonstrate comparable performances, the two-step approach holds potential advantages in scalability and flexibility, especially with datasets where abusive language spans multiple specific topics. Importantly, the two-step strategy offers a modular framework that can integrate various classifiers optimized for distinct parts of the classification process.

Further research could explore hybrid systems that dynamically adjust between one-step and two-step methodologies based on data characteristics or predicted risk levels of content, thereby enhancing computational efficiency. Expanding training datasets with more diverse and representative samples could further refine model accuracy and address potential biases inherent in the training data.

In conclusion, this paper provides essential insights into the design and optimization of models for abusive language detection on social media platforms. It highlights the comparable efficiencies of one-step and two-step processes and sets the stage for future explorations into more adaptive and robust models that cater to the multifaceted nature of online discourse.