- The paper introduces a corpus of approximately 39,000 Hindi-English code-mixed social media posts annotated with a hierarchical scheme to identify different levels and types of aggression.
- Analysis of the corpus reveals over 72% inter-annotator agreement for top-level aggression, highlights distinct aggression patterns between Facebook and Twitter, and observes associations between code-mixing and aggressive language.
- This annotated dataset serves as a valuable resource for training and evaluating machine learning models for automatic aggression detection in code-mixed languages and understanding the linguistic features of online aggression.
Overview of Aggression-Annotated Corpus of Hindi-English Code-Mixed Data
The paper "Aggression-annotated Corpus of Hindi-English Code-mixed Data" by Kumar et al. serves as an insightful contribution to the field of natural language processing, particularly in the domain of social media discourse analysis. With the ever-growing prevalence of aggression like trolling, cyberbullying, and hate speech on online platforms, the development of resources to automatically detect and manage these behaviors becomes essential. The authors address this issue by presenting a meticulously developed corpus annotated with aggression tags, which could significantly advance research in automatic aggression-detection systems.
Methodology
The paper's methodological framework revolves around constructing a corpus from Twitter and Facebook, platforms well-discussed among the Indian populace, often in Hindi or Hindi-English code-mixed language. The dataset encapsulates approximately 18,000 tweets and 21,000 Facebook comments, reflecting different aggression levels such as overt, covert, and non-aggressive. A hierarchical annotation scheme is employed, featuring three top-level aggression categories further divided into ten subcategories. This expansive tagset enables a more granular analysis of aggressive discourse, including facets like physical threats, sexual aggression, and various identity-based aggression types.
Results and Analysis
The annotation process was systematically conducted using both internal annotators and crowdsourcing through the Crowdflower platform. Notably, the inter-annotator agreement improved significantly to over 72% for top-level aggression classification after refining annotation guidelines. The corpus unveils distinct communication patterns between Facebook and Twitter users, highlighting more overt aggression on Facebook and a prevalence of covert aggression on Twitter. The dataset predominantly features political aggression, and intriguing associations between code-mixing and aggressive language are observed.
Implications
The implications of this annotated corpus are multifaceted, offering both practical and theoretical advancements. Practically, the corpus provides a foundational dataset for developing algorithms capable of automatically identifying varying aggression types in code-mixed languages, a task of substantial complexity as evidenced by initial F1 scores hovering around 0.70. Theoretically, this work emphasizes the necessity of understanding pragmatic linguistic features for aggression analysis beyond simple sentiment detection. The corpus can be leveraged to explore the interplay between linguistic features and social dynamics within online platforms.
Future Directions
Future developments prompted by this research may include the enhancement of machine learning models to improve classification accuracy using this dataset. Additionally, expanding the corpus to include other Indian languages, or incorporating multimodal data such as images and videos prevalent in social media, could enrich insights into aggression detection. Potential cross-cultural studies might also explore how aggression manifests differently across linguistic and cultural boundaries using similar annotated resources.
In conclusion, while this corpus marks a significant stride towards automatic aggression detection in Hindi-English code-mixed data, it also exposes challenges that necessitate continuous exploration, both in linguistic nuance and computational strategies. Researchers can build upon this work to uncover further insights into the manifestations and mitigations of online aggression.