Sentiment Analysis with LLMs
- Sentiment Analysis with LLMs is the process of using transformer-based models to classify and quantify affective text content across diverse languages and domains.
- Key contributions include integrating domain adaptation, parameter-efficient fine-tuning, and ensemble techniques that boost accuracy and robustness in varied settings.
- Empirical benchmarks demonstrate state-of-the-art performance in multilingual and multimodal sentiment tasks while addressing challenges like sarcasm and ambiguous expressions.
Sentiment analysis with LLMs investigates the capacity of transformer-based models, typically parameterized on the order of 100 million or more, to classify and quantify the affective content of text across domains, languages, and label granularities. LLMs have established new state-of-the-art benchmarks for both monolingual and multilingual sentiment classification, but also present unique requirements in data preparation, training methodology, evaluation, architectural adaptation, and robust deployment. This article synthesizes key technical approaches, empirical findings, and open challenges, drawing on recent arXiv research to present a comprehensive examination for technically advanced readers.
1. Model Architectures, Taxonomies, and Adaptation Strategies
LLMs suitable for sentiment analysis span both encoder-only (BERT-style), encoder–decoder (T5), and decoder-only (GPT-style) transformer architectures. Bidirectional models (e.g., BERT, RoBERTa, FinBERT, XLM-R) and autoregressive architectures (e.g., GPT-2/3/4, LLaMA, OPT) are both employed, each with adaptation patterns:
- Encoder-based LLMs: Sentiment analysis is cast as sequence classification, using a [CLS] token representation followed by a task-specific classification head; cross-entropy loss over discrete sentiment labels is typical (Bilehsavar et al., 28 Sep 2025, Kirtac et al., 5 Mar 2025).
- Decoder-only/GPT-style LLMs: Sentiment classification is formulated either as next-token generation with sentiment tokens (prompt-based) or as instruction-response generation, converting the classification to a conditional language modeling problem (Zhang et al., 2023, Rusnachenko et al., 2024).
Parameter-efficient fine-tuning methods such as LoRA (low-rank adaptation) and Adapter modules are increasingly used to adapt large models under compute constraints, with rank-search and dynamic allocation frameworks (e.g., DARSE) proven effective for optimizing utility relative to resource budgets (Ding et al., 2024).
A key structural trend is ensembling: probability averaging of multiple LLMs, explicit late fusion using probabilistic graphical models (e.g., Bayesian Network LLM Fusion), or majority voting fusion, consistently yields superior accuracy and robustness across varying input domains and languages (Bilehsavar et al., 28 Sep 2025, Amirzadeh et al., 30 Oct 2025, Mabokela et al., 21 Nov 2025).
2. Data Preparation, Labeling Schemes, and Domain Transfer
Performance of LLM-based sentiment analysis is closely tied to data domain, label schema, and multilingual coverage:
- Datasets range from social media (e.g., Multilingual Sentiment Analysis Kaggle, Sentiment140, SAGovTopicTweets), customer/product reviews (Amazon, Yelp, Lithuanian reviews), and specialized domains (financial news, RuSentNE-2023 for Russian TSA, Chinese FinChina SA) (Bilehsavar et al., 28 Sep 2025, VileikytÄ— et al., 2024, Lan et al., 2023, Mabokela et al., 21 Nov 2025).
- Class granularity: Coarse three-way classification (positive/neutral/negative) remains dominant, but fine-grained approaches (entity- and aspect-level, five-star regression, warning-type detection) are increasingly important (Zhou et al., 2024, VileikytÄ— et al., 2024, Lan et al., 2023).
- Preprocessing: Typical pipelines include language filtration, lowercasing, removal of URLs/user tokens, de-duplication, and task-specific retention (preservation of hashtags and emojis, sequence length truncation) (Bilehsavar et al., 28 Sep 2025, VileikytÄ— et al., 2024).
- For label imbalance, stratified sampling or down-sampling is key in maintaining class balance for robust evaluation (VileikytÄ— et al., 2024).
Multilingual sentiment remains a major challenge. Multilingual LLMs (mBERT, XLM-R, GPT-4) have established robust test performance exceeding 86% accuracy across 6–10 languages, and ensembling further elevates language-wise accuracy and F1-scores, mitigating resource disparity effects (Bilehsavar et al., 28 Sep 2025, Mabokela et al., 21 Nov 2025).
3. Training Methodologies: Supervised, Semi-Supervised, and Prompt-based Protocols
Sentiment analysis with LLMs supports multiple learning paradigms:
- Supervised fine-tuning is standard, using cross-entropy losses over labeled data (discrete or continuous) (Bilehsavar et al., 28 Sep 2025, Inserte et al., 2024).
- Prompt-based zero- and few-shot learning leverages explicit natural language templates for in-context classification; major LLMs (GPT-4, LLaMA, Flan-T5) show strong zero-shot accuracy, especially when prompts and output formats are standardized (Mabokela et al., 21 Nov 2025, Zhang et al., 2023, Zhou et al., 2024).
- Parameter-efficient tuning (e.g., LoRA, Adapter, QLoRA) is critical for scaling LLMs. Dynamic rank allocation (DARSE) yields 4.3% relative accuracy gains and 15% lower regression MSE by optimizing the injection of low-rank adapters on a per-layer basis under parameter constraints (Ding et al., 2024).
- Semi-supervised protocols: Approaches such as Semantic Consistency Regularization with LLMs generate augmented paraphrases or concept-based rewrites of unlabeled data, enforcing consistency losses between original and LLM-augmented predictions and using class re-assembly (class-space shrinking) to harvest supervisory signal from ambiguous examples (Li et al., 29 Jan 2025).
- Domain adaptation: Further pre-training on target-domain text, instruction-based multi-task tuning, and synthetic instruction augmentation (e.g., generating new financial task examples using stronger LLMs) are proven to be effective for domain transfer—small models (<1.5B params) can match models 10–50× larger with such strategies (Inserte et al., 2024, Zhang et al., 2023).
- Weak supervision: Leveraging discourse markers as weak labels during inter-training substantially boosts few/zero-shot transfer, with gains up to 12.6 percentage points in financial settings, and automatic discovery of domain-specific DMs further enhances results (Ein-Dor et al., 2022).
4. Evaluation Protocols and Empirical Benchmarks
Standard classification metrics (Accuracy, Precision, Recall, Macro- and Micro-F1, ROC-AUC) are complemented by task-specific measures (Exact-Match for ABSA, MSE/MAE/R2 for regression) (Bilehsavar et al., 28 Sep 2025, Zhou et al., 2024, Ding et al., 2024). Comprehensive multilingual and domain-specific evaluation mandates per-class, per-language, and per-task breakdowns. Key empirical observations include:
- Ensemble and fusion techniques consistently provide 1–6 percentage point gains in F1 or accuracy over strongest individual LLMs, and more robust error profiles (especially on low-resource and code-switched inputs) (Bilehsavar et al., 28 Sep 2025, Amirzadeh et al., 30 Oct 2025, Mabokela et al., 21 Nov 2025).
- In aspect-based sentiment analysis (ABSA), parameter-efficient fine-tuned LLMs (e.g., LoRA-tuned LLaMA3-8B) achieve average F1=85.9, surpassing fully fine-tuned SLMs by over 3 points. Zero/few-shot in-context learning using prompt demonstrations yields competitive results in low-resource settings, especially with retrieval-based selection (BM25/SimCSE) for relevant exemplars (Zhou et al., 2024).
- Fine-tuning on task-specific instructions consistently outperforms document-level pre-training or non-instructional fine-tuning, especially on numerical and context-rich sentiment judgments (Zhang et al., 2023, Inserte et al., 2024).
- Zero-shot LLMs reach parity with strong supervised baselines on classical (IMDB, Yelp, SST-2) and certain financial sentiment tasks, but struggle on structured, multi-element or aspect-dependent (ABSA, ASTE, ASQP) evaluations without further adaptation or prompt tuning (Zhang et al., 2023, Zhou et al., 2024).
- For sentiment tasks in low-resource languages, multilingual LLMs fine-tuned on labeled data achieve sizeable gains: e.g., DistilBERT outperforms GPT-4 zero-shot by 8%–11% accuracy on Lithuanian five-star sentiment (Vileikytė et al., 2024).
5. Performance Limitations, Error Analysis, and Robustness Interventions
Despite strong aggregate performance, LLM-based sentiment analysis exhibits recurrent error modes and sensitivity to certain linguistic phenomena:
- Adjacent-class confusion persists in multiclass settings (e.g., Negative ↔ Neutral), code-switched or morphologically complex inputs experience increased misclassification (Bilehsavar et al., 28 Sep 2025, Vileikytė et al., 2024).
- Sarcasm, irony, and rhetorical questions are systematically misclassified, especially with narrow-domain fine-tuning or limited exposure to such phenomena (Bhargava et al., 8 Apr 2025, Liu et al., 2024).
- Ambiguity and idiomaticity defeat otherwise strong LLMs, with model outputs drifting under prompt bias or insufficiently representative demonstrations (Liu et al., 2024).
- Robustness interventions include:
- Sarcasm removal via LLM paraphrasing (GPT-3.5) can raise sarcastic-tweet classification accuracy by 17–21 percentage points (Bhargava et al., 8 Apr 2025).
- Adversarial text augmentation increases resilience to minor surface perturbations, elevating sarcastic-tweet accuracy from ≈36% to ≈84% (Bhargava et al., 8 Apr 2025).
- Data paraphrasing recovers low-confidence labels: paraphrasing led 40% of ambiguous samples to high-confidence majority and delivered 3–6% absolute accuracy uplifts on nuclear-policy tweets (Bhargava et al., 8 Apr 2025).
- Prompt normalization and format alignment reduces output variance and format errors (especially for ABSA and entity/targeted TSA tasks) (Zhou et al., 2024, Zhang et al., 2023).
- Fusion/ensemble aggregation mitigates topic- and language-specific error patterns, notably collapse in low-resource languages or domain-skewed datasets (Mabokela et al., 21 Nov 2025, Amirzadeh et al., 30 Oct 2025).
6. Multimodal and Multilingual Sentiment Analysis
Multimodal sentiment analysis incorporates LLMs for text-centric fusion of linguistic, visual, audio, and paralinguistic cues:
- Fusion strategies: Modal transformation (prepending captions or emotion tags to text prompts), cross-attention and co-attention modules for integrating vision and audio, and tensor fusion for explicit high-order interaction (Yang et al., 2024).
- Fine-tuning vs. prompting: Full multimodal fine-tuning achieves highest accuracy but is compute-intensive, while parameter-efficient (adapter, LoRA) and zero/few-shot prompt-based approaches offer flexible, rapid deployment with minimal engineering (Yang et al., 2024).
- Performance: Multimodal LLMs achieve up to 69.8% accuracy on image-text sentiment classification (0-shot), with strong results on fine-grained aspect sentiment via both fine-tuning and prompting (Yang et al., 2024).
- Lingual diversity: Recent studies have demonstrated near-human and cross-lingual performance with LLM majority-vote fusion for sentiment on South African languages and multi-domain multilingual tweets (Mabokela et al., 21 Nov 2025, Bilehsavar et al., 28 Sep 2025).
7. Practical Applications and Deployment Considerations
LLM-driven sentiment analysis powers a diverse array of applications:
- Financial markets: Sentiment-linked trading systems combining LLM classifiers (FinBERT, GPT-2, RoBERTa) with technical or time-series indicators (MACD, ARIMA/ETS) have been shown to outperform buy-and-hold strategies on the S&P 500 by 4–6 percentage points during volatile periods, with substantially higher Sharpe ratios (Liu et al., 13 Jul 2025, Kirtac et al., 5 Mar 2025).
- Social monitoring: Fused LLM architectures enable real-time detection of social challenges in low-resource languages, demonstrating less than 1% classification error using ensemble fusion (Mabokela et al., 21 Nov 2025).
- Explainable AI: Probabilistic late-fusion using Bayesian Network LLM Fusion yields transparent, counterfactually explorable models, supporting regulatory and mission-critical sentiment deployments (Amirzadeh et al., 30 Oct 2025).
- Econometric dashboards and quantitative trading: Modular hybrid architectures deploy BERT-style LLMs for rapid binary classification and GPT-4-like models for narrative and contextual summary, offering a human-readable sentiment trace (Kirtac et al., 5 Mar 2025).
Best practices for deployment include continual in-domain adaptation, hybrid model chaining, robust evaluation against prompt and data variation, and API cost/latency management in fusion strategies. Fusing LLM outputs across multiple architectures is essential for reliability in multilingual and adversarial settings.
These findings collectively establish that, through a combination of domain adaptation, prompt and architecture engineering, data-centric robustness interventions, and principled ensemble/fusion strategies, LLMs serve as the technical backbone for state-of-the-art sentiment analysis across an unprecedented range of languages, domains, and modalities. However, they also affirm the persistent research need for explicit treatment of sarcasm, cultural nuance, and structured aspect/task formulations in order to close the remaining gaps between generic language modeling and domain-specific sentiment inference (Bilehsavar et al., 28 Sep 2025, Mabokela et al., 21 Nov 2025, Inserte et al., 2024, Ding et al., 2024, Bhargava et al., 8 Apr 2025, Zhou et al., 2024, VileikytÄ— et al., 2024, Li et al., 29 Jan 2025, Amirzadeh et al., 30 Oct 2025, Kirtac et al., 5 Mar 2025).