Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
156 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Aggression-annotated Corpus of Hindi-English Code-mixed Data (1803.09402v1)

Published 26 Mar 2018 in cs.CL

Abstract: As the interaction over the web has increased, incidents of aggression and related events like trolling, cyberbullying, flaming, hate speech, etc. too have increased manifold across the globe. While most of these behaviour like bullying or hate speech have predated the Internet, the reach and extent of the Internet has given these an unprecedented power and influence to affect the lives of billions of people. So it is of utmost significance and importance that some preventive measures be taken to provide safeguard to the people using the web such that the web remains a viable medium of communication and connection, in general. In this paper, we discuss the development of an aggression tagset and an annotated corpus of Hindi-English code-mixed data from two of the most popular social networking and social media platforms in India, Twitter and Facebook. The corpus is annotated using a hierarchical tagset of 3 top-level tags and 10 level 2 tags. The final dataset contains approximately 18k tweets and 21k facebook comments and is being released for further research in the field.

Citations (167)

Summary

  • The paper introduces a corpus of approximately 39,000 Hindi-English code-mixed social media posts annotated with a hierarchical scheme to identify different levels and types of aggression.
  • Analysis of the corpus reveals over 72% inter-annotator agreement for top-level aggression, highlights distinct aggression patterns between Facebook and Twitter, and observes associations between code-mixing and aggressive language.
  • This annotated dataset serves as a valuable resource for training and evaluating machine learning models for automatic aggression detection in code-mixed languages and understanding the linguistic features of online aggression.

Overview of Aggression-Annotated Corpus of Hindi-English Code-Mixed Data

The paper "Aggression-annotated Corpus of Hindi-English Code-mixed Data" by Kumar et al. serves as an insightful contribution to the field of natural language processing, particularly in the domain of social media discourse analysis. With the ever-growing prevalence of aggression like trolling, cyberbullying, and hate speech on online platforms, the development of resources to automatically detect and manage these behaviors becomes essential. The authors address this issue by presenting a meticulously developed corpus annotated with aggression tags, which could significantly advance research in automatic aggression-detection systems.

Methodology

The paper's methodological framework revolves around constructing a corpus from Twitter and Facebook, platforms well-discussed among the Indian populace, often in Hindi or Hindi-English code-mixed language. The dataset encapsulates approximately 18,000 tweets and 21,000 Facebook comments, reflecting different aggression levels such as overt, covert, and non-aggressive. A hierarchical annotation scheme is employed, featuring three top-level aggression categories further divided into ten subcategories. This expansive tagset enables a more granular analysis of aggressive discourse, including facets like physical threats, sexual aggression, and various identity-based aggression types.

Results and Analysis

The annotation process was systematically conducted using both internal annotators and crowdsourcing through the Crowdflower platform. Notably, the inter-annotator agreement improved significantly to over 72% for top-level aggression classification after refining annotation guidelines. The corpus unveils distinct communication patterns between Facebook and Twitter users, highlighting more overt aggression on Facebook and a prevalence of covert aggression on Twitter. The dataset predominantly features political aggression, and intriguing associations between code-mixing and aggressive language are observed.

Implications

The implications of this annotated corpus are multifaceted, offering both practical and theoretical advancements. Practically, the corpus provides a foundational dataset for developing algorithms capable of automatically identifying varying aggression types in code-mixed languages, a task of substantial complexity as evidenced by initial F1 scores hovering around 0.70. Theoretically, this work emphasizes the necessity of understanding pragmatic linguistic features for aggression analysis beyond simple sentiment detection. The corpus can be leveraged to explore the interplay between linguistic features and social dynamics within online platforms.

Future Directions

Future developments prompted by this research may include the enhancement of machine learning models to improve classification accuracy using this dataset. Additionally, expanding the corpus to include other Indian languages, or incorporating multimodal data such as images and videos prevalent in social media, could enrich insights into aggression detection. Potential cross-cultural studies might also explore how aggression manifests differently across linguistic and cultural boundaries using similar annotated resources.

In conclusion, while this corpus marks a significant stride towards automatic aggression detection in Hindi-English code-mixed data, it also exposes challenges that necessitate continuous exploration, both in linguistic nuance and computational strategies. Researchers can build upon this work to uncover further insights into the manifestations and mitigations of online aggression.