What is Sentiment Meant to Mean to Language Models? (2405.02454v1)

Published 3 May 2024 in cs.CL and cs.AI

Abstract: Sentiment analysis is one of the most widely used techniques in text analysis. Recent advancements with LLMs have made it more accurate and accessible than ever, allowing researchers to classify text with only a plain English prompt. However, "sentiment" entails a wide variety of concepts depending on the domain and tools used. It has been used to mean emotion, opinions, market movements, or simply a general ``good-bad'' dimension. This raises a question: What exactly are LLMs doing when prompted to label documents by sentiment? This paper first overviews how sentiment is defined across different contexts, highlighting that it is a confounded measurement construct in that it entails multiple variables, such as emotional valence and opinion, without disentangling them. I then test three LLMs across two data sets with prompts requesting sentiment, valence, and stance classification. I find that sentiment labels most strongly correlate with valence labels. I further find that classification improves when researchers more precisely specify their dimension of interest rather than using the less well-defined concept of sentiment. I conclude by encouraging researchers to move beyond "sentiment" when feasible and use a more precise measurement construct.

Authors (1)

Michael Burnham (4 papers)

Citations (2)

View on Semantic Scholar

Summary

Exploring Sentiment Analysis with LLMs: A Detailed Study

Understanding Sentiment as a Complex Measure

Sentiment analysis often presents itself as a straightforward concept in text analysis; however, this paper emphasizes that sentiment is multifaceted and often conflated with various other dimensions like emotional valence and stance (or opinion). This complexity stems from the different definitions and interpretations of 'sentiment' across domains and tools, meaning that when we talk about sentiment analysis, we might actually be dealing with a mix of emotions, opinions, and other text dimensions.

The research conducts a thorough investigation by testing how well three different LLMs – GPT-4 Turbo, Claude-3 Opus, and Llama-3 8B – understand and perform sentiment analysis compared to more defined constructs like emotional valence and stance.

Experiment Setup

Data: The paper uses two different datasets:
- A set of 2,390 hand-labeled tweets about politicians to analyze stance.
- A sentiment dataset consisting of 2,000 tweets from the SemEval-2017 challenge, specifically labeled for emotional valence.
Methodology: Each document was processed through the LLMs using three different prompts:
- Sentiment classification
- Emotional valence classification
- Stance classification

The main metric for performance comparison was the Matthew's Correlation Coefficient (MCC), which provides a robust measure by accounting for both the true and false positives and negatives.

Key Findings and Insights

Role of Specific Prompts in Model Performance

For Stance: The analysis revealed that when models were directly prompted for stance classification, they significantly outperformed the sentiment and valence prompts. This illustrates that specificity in prompts can substantially improve the reliability of classification tasks performed by LLMs, a crucial finding for researchers aiming to extract precise information from text data.
For Emotional Valence: Although the models had a better understanding of emotional valence when prompted with 'sentiment,' directly asking for valence classification still yielded improved results. This supports the idea that eliminating ambiguity in prompts enhances model performance.

Understanding Sentiment in Models

Sentiment as Emotional Valence: Across all models, sentiment was predominantly interpreted as emotional valence. However, the alignment wasn't perfect, indicating that the LLMs might also associate sentiment with other constructs subtly embedded within training data.
Discrepancy in Stance vs. Sentiment: There was a notable disagreement between sentiment and stance classifications, especially in cases of neutral sentiments, suggesting that using sentiment analysis for opinion mining might not be as reliable.

Practical Implications

These findings nudge researchers and practitioners towards reconsidering the use of generic 'sentiment analysis' for diverse text analysis tasks. The paper advocates for specificity—prompting models with clear and precise constructs can lead to more accurate and theoretically sound analyses.

Theoretical Implications and Future Directions

Theoretically, this paper challenges the prevailing broad application of sentiment analysis and suggests that the field could benefit from distinguishing more sharply between emotional valence, stance, and other dimensions. Future research might further explore how different configurations of LLM prompts can enhance the precision of text classification, and experiment with a variety of models to verify these findings across broader contexts and datasets.

Conclusion

Overall, this detailed investigation into how LLMs interpret and perform sentiment analysis offers valuable insights for advancing the precision and effectiveness of text analysis techniques. As LLMs continue to evolve, harnessing their capabilities with more nuanced prompts could significantly impact both academic research and practical applications in fields reliant on text data.

Related Papers

Find Related Papers

Tweets

https://twitter.com/ML_Burn/status/1787878615402602701

https://twitter.com/fly51fly/status/1789666442066968619

https://twitter.com/GptMaestro/status/1788600200627626080

YouTube

Show All Videos