A Large Self-Annotated Corpus for Sarcasm (1704.05579v4)

Published 19 Apr 2017 in cs.CL, cs.AI, and cs.LG

Abstract: We introduce the Self-Annotated Reddit Corpus (SARC), a large corpus for sarcasm research and for training and evaluating systems for sarcasm detection. The corpus has 1.3 million sarcastic statements -- 10 times more than any previous dataset -- and many times more instances of non-sarcastic statements, allowing for learning in both balanced and unbalanced label regimes. Each statement is furthermore self-annotated -- sarcasm is labeled by the author, not an independent annotator -- and provided with user, topic, and conversation context. We evaluate the corpus for accuracy, construct benchmarks for sarcasm detection, and evaluate baseline methods.

Citations (219)

View on Semantic Scholar

Summary

The paper introduces SARC, a large-scale Reddit dataset containing 1.3 million sarcastic comments along with numerous non-sarcastic ones.
The methodology leverages community self-annotation via the '/s' marker and filtering techniques to achieve low false positive (1.0%) and false negative (2.0%) rates.
Benchmark analyses reveal the potential of both context-independent and context-aware models, providing clear avenues for enhancing NLP sarcasm detection.

An Analysis of "A Large Self-Annotated Corpus for Sarcasm"

The paper "A Large Self-Annotated Corpus for Sarcasm" by Khodak, Saunshi, and Vodrahalli introduces a novel dataset known as the Self-Annotated Reddit Corpus (SARC). SARC stands as a substantial contribution to the domain of NLP with specific emphasis on sarcasm detection. The corpus provides an extensive collection of sarcastic and non-sarcastic statements derived from Reddit, a popular social media platform. The authors claim that this corpus allows for more effective training and evaluation of sarcasm detection systems.

Overview and Methodology

The authors present SARC as a groundbreaking resource due to its substantial size—comprising 1.3 million sarcastic comments, significantly surpassing previous datasets by a factor of ten. This corpus also contains an extensive number of non-sarcastic comments, facilitating both balanced and unbalanced learning scenarios. A unique feature of SARC is its self-annotated nature; sarcasm is labeled by the original authors through a standardized marker, offering a natural context that circumvents the bias introduced by third-party annotators.

SARC's construction leverages Reddit’s community-driven conversations, where users often tag sarcastic comments with a "/s" marker. The dataset spans several years, between January 2009 and April 2017, capturing the nuances and the diverse stylistic features characteristic of Reddit discussions. The authors implement a series of filtering techniques to ensure data quality, such as excluding comments with non-ASCII characters and those directly replying to other sarcastic comments, thus minimizing annotation errors.

Corpus Evaluation and Benchmarks

To validate the utility of SARC, the authors undertaken a thorough evaluation of its contents. The dataset's noise levels were assessed through manual inspection of a sample of comments to determine the rates of false positives and negatives. A reported false positive rate of 1.0% and a false negative rate of 2.0% reflect commendable data integrity, albeit highlighting some inherent challenges in sarcasm annotation consistency.

In advancing the state-of-the-art for sarcasm detection, the paper proposes several benchmarks using SARC, with a focus on both context-independent and context-aware scenarios. The authors evaluate various baseline sarcasm detection techniques such as Bag-of- $n$ -Grams and sentence embeddings, revealing that while these methods offer a starting point, there is clear room for enhancing sarcasm detection algorithms, particularly through better utilization of conversational context.

Implications and Future Directions

The availability of SARC has practical implications for several NLP applications, including dialogue systems and sentiment analysis, where understanding sarcasm can drastically improve system accuracy. From a theoretical standpoint, the dataset offers fertile ground for advancing research on conversational context and user-specific linguistic patterns in sarcasm.

Future developments could focus on refined feature extraction methodologies that better capture the implicit and situational intricacies of sarcasm. Neural architectures leveraging context from both prior and ensuing conversational turns can be explored to outperform current benchmarks.

Conclusion

"A Large Self-Annotated Corpus for Sarcasm" marks a valuable enhancement in linguistic resources for NLP research dedicated to sarcasm. The comprehensive nature and scope of SARC empowers researchers to more effectively tackle the nuanced challenge of sarcasm detection, providing a robust platform for methodological innovations in sarcasm comprehension within various AI systems.

PDF Markdown