Inducing Domain-Specific Sentiment Lexicons from Unlabeled Corpora (1606.02820v2)

Published 9 Jun 2016 in cs.CL

Abstract: A word's sentiment depends on the domain in which it is used. Computational social science research thus requires sentiment lexicons that are specific to the domains being studied. We combine domain-specific word embeddings with a label propagation framework to induce accurate domain-specific sentiment lexicons using small sets of seed words, achieving state-of-the-art performance competitive with approaches that rely on hand-curated resources. Using our framework we perform two large-scale empirical studies to quantify the extent to which sentiment varies across time and between communities. We induce and release historical sentiment lexicons for 150 years of English and community-specific sentiment lexicons for 250 online communities from the social media forum Reddit. The historical lexicons show that more than 5% of sentiment-bearing (non-neutral) English words completely switched polarity during the last 150 years, and the community-specific lexicons highlight how sentiment varies drastically between different communities.

Authors (4)

William L. Hamilton (46 papers)
Kevin Clark (16 papers)
Jure Leskovec (233 papers)
Dan Jurafsky (118 papers)

Citations (325)

View on Semantic Scholar

Summary

Inducing Domain-Specific Sentiment Lexicons from Unlabeled Corpora

The research paper, "Inducing Domain-Specific Sentiment Lexicons from Unlabeled Corpora," presents a sophisticated methodology for creating sentiment lexicons tuned to specific domains using unlabeled corpora. This process is particularly pertinent to the field of Computational Social Science (CSS), where understanding sentiment in context-specific texts is crucial.

Methodology

The authors introduce a novel framework that combines domain-specific word embeddings with a label propagation approach. This framework facilitates the creation of domain-specific sentiment lexicons using only small sets of seed words, producing performance that rivals existing methods that rely on manually curated resources.

Key steps in the framework include:

Construction of Lexical Graphs: Utilizing distributional word embeddings to form a vector space model (VSM), where words are embedded as vectors based on co-occurrence statistics. These vectors form a weighted lexical graph, employing cosines of angles between vectors as edge weights.
Label Propagation: A random walk method is applied over the lexical graph to propagate sentiment labels. Each word's polarity is derived from the probability that a random walk from the seed set will reach it.
Confidence Estimation: Bootstrap sampling is utilized to acquire confidence scores for the propagated sentiment values, providing a measure of robustness against the choice of seed words.

Empirical Studies and Findings

The authors conducted two large-scale empirical studies. First, they examined community-specific sentiment lexicons across 250 Reddit communities, uncovering significant sentiment divergence among different communities. Second, they constructed historical sentiment lexicons covering 150 years of English, identifying that more than 5% of sentiment-bearing words had shifted polarity over time.

Strong numerical results from the studies include:

Detection of more than 5% of non-neutral words switching polarity over 150 years.
Identification of substantial sentiment variation among Reddit communities, illustrated through words like "soft" showing positive sentiment in some contexts but negative in others.

Implications and Future Directions

The implications of this research are notable both practically and theoretically. Practically, the ability to automatically generate domain-specific sentiment lexicons helps CSS and other fields by offering robust tools for accurate sentiment analysis in varied textual contexts. Theoretically, the findings highlight dynamic shifts in word sentiment influenced by temporal and social factors.

Future developments could include further integration with supervised domain-adaptation methods to refine domain-specific accuracy. Additionally, expanding this framework to other languages and scripts would enhance its applicability in diverse linguistic contexts.

In conclusion, this research contributes significant insights and methodologies for creating domain-specific sentiment lexicons, supporting further exploration of sentiment variation in linguistics and social sciences. The framework serves as a robust tool for researchers requiring precise sentiment analysis tailored to specific textual domains.

PDF Markdown