SentiTurca: Turkish Sentiment Analysis Resources
- SentiTurca is a Turkish sentiment analysis resource comprising a benchmark dataset and a polarity lexicon tailored for the language's agglutinative structure and rich morphology.
- It integrates native texts from social media, e-commerce, and forums across multiple domains to ensure linguistic naturalness, diversity, and reproducibility.
- Its lexicon-toolkit employs unsupervised, semi-supervised, and morphological analysis techniques, achieving high classification accuracy in sentiment tasks.
SentiTurca is the name of two distinct, influential resources in Turkish sentiment analysis and natural language understanding: (1) a Turkish-native sentiment benchmark dataset introduced alongside TrGLUE for evaluating model performance across multiple sentiment tasks, and (2) a polarity lexicon and toolkit combining unsupervised, semi-supervised, and morphological components for lexicon-based sentiment analysis. Both resources address the specific challenges of Turkish as an agglutinative and morphologically rich language.
1. Benchmark Motivation and Design Principles
SentiTurca, as a dataset, was created to fill the notable gap in large-scale, linguistically native Turkish sentiment benchmarks—essential for standardized evaluation of models such as transformers, LLMs, and classical NLP systems. Key design motivations include:
- Linguistic Naturalness: All corpora are composed of original Turkish texts (not translations) sourced from social media, e-commerce reviews, and crowdsourced platforms (e.g., Ekşi Sözlük).
- Domain Diversity and Task Complexity: Covers three distinct domains—movie reviews, customer e-commerce reviews, and free-form hate/offensive language—presenting a wide range of linguistic structures and sentiment phenomena.
- Scalability and Reproducibility: Employs systematic semi-automated labeling and filtering to balance annotation quality and dataset size.
- Annotation Quality Assurance: Implements rigorous guidelines, multiple annotator protocols, and formal agreement metrics for high-reliability ground truth (Altinok, 26 Dec 2025).
Separately, the SentiTurca lexicon/toolkit emerged to provide a detailed, corpus- and domain-adaptable polarity resource for Turkish, supporting methods that exploit the language's agglutination and subtle morphemic sentiment cues (Aydin, 29 Nov 2025).
2. Dataset Construction Methodology
SentiTurca as a benchmark consists of three constituent corpora, each with a domain-specific construction protocol:
- Movie Reviews: Crawled from Sinefil.com (≈ 40,000) and Beyazperde.com (≈ 38,000) with 0–10 star ratings. Star-to-sentiment mapping: negative (1–4★), positive (7–10★); 6★ instances omitted as ambiguous.
- Customer Reviews: Approximately 103,000 entries from leading e-commerce platforms (Hepsiburada, Trendyol), labeled by direct mapping from 1–5 star ratings and filtered for rating-text contradictions via clustering and pattern matching.
- Hate/Offensive Language: 52,000+ Ekşi Sözlük entries across 13 categories (e.g., misogyny, refugees, ethnic groups), using topic headlines as weak labels, with main sentiment annotation performed by native linguists.
For movie and customer reviews, spot checks ensured rating-label alignment and flagged mismatches. The hate/offensive language portion underwent two annotation rounds: the first (ICC ≈ 0.612) was discarded due to low agreement; after guidelines revision, triple-annotation yielded ICC = 0.912, indicating excellent consistency (Altinok, 26 Dec 2025).
3. Data Structure and Label Schema
The following table summarizes the SentiTurca splits and metrics:
| Subset | Train | Dev | Test | Classes | Recommended Metric |
|---|---|---|---|---|---|
| Movie reviews | 60,000 | 8,900 | 8,900 | 2-class (± sentiment) | Accuracy / F₁ |
| Customer reviews | 73,000 | 15,000 | 15,000 | 5-star (ordinal) | Accuracy / Macro-F₁ |
| Turkish Hate Map | 42,000 | 5,000 | 5,000 | 4-class (hate, offend…) | Balanced Accuracy / Macro-F₁ |
Class distributions are approximately balanced for movie reviews; customer reviews are skewed toward positive (4–5★ ≈ 60%); hate/offensive entries are dominated by neutral/offensive, with highly hateful content especially in the “refugees” category. The hate/offensive class schema comprises “hate,” “offensive,” “neutral,” and “civilized,” learned from crowdsourced annotation (Altinok, 26 Dec 2025).
4. SentiTurca Lexicon and Toolkit
The SentiTurca polarity lexicon was constructed through three complementary approaches:
- Unsupervised Polarity Scoring: Computes for each word using pointwise mutual information (PMI) over co-occurrence hits, leveraging antonym seed sets. Positive or negative orientation is determined by the sign of .
- Semi-supervised Domain-Specific Label Propagation: For in-domain vocabulary , random-walk propagation over a PMI-induced graph initialized with seed polarities produces , favoring early propagation paths via factorial weighting.
- Fine-grained Morphological Analysis: Employs Turkish morphological analyzers (e.g., Zemberek) to score suffixes (e.g., “–cı,” “–sA,” “–cıUz”) and constructs partial surface forms by retaining discriminative morpheme sequences. Suffix polarities are ranked by absolute value, with empirically optimized thresholds per domain (e.g., 90% for movie, 50% for Twitter) (Aydin, 29 Nov 2025).
Lexicon entries are provided in UTF-8 TSV with the following columns: token/morpheme, POS, source strategy (“unsup,” “semi,” or “morph”), polarity score, applicable domain, and confidence (for suffixes: ).
Illustrative entries:
| Token/Morpheme | POS | Strategy | Polarity Score | Domain | Confidence |
|---|---|---|---|---|---|
| harika | Adj | unsup | +1.84 | general | 1.00 |
| berbat | Adj | semi | –2.05 | movie | 0.97 |
| –sA | Suff | morph | +0.62 | general | 0.62 |
5. Feature Engineering and Classification Protocols
SentiTurca’s toolkit exposes multiple feature families for document-level sentiment classification:
- Unsupervised (), Semi-supervised (), and Supervised (, ) Features: Aggregated at document-level via min/max/mean of , and combined using coefficients , based on consistency between signs of unsupervised and supervised signals.
- Composite Document Features: Final document feature vectors concatenate , min/mean/max of these scores, and sparse tf-idf representations for SVM, NB, kNN, and J-48 classifiers.
- Aspect-Based Sentiment Analysis (ABSA): Partitioning by subclause using dependency parsing, with aspect-level embeddings from RecNNs and global context from RNNs for nuanced polarity sensing (Aydin, 29 Nov 2025).
6. Empirical Performance and Evaluation
SentiTurca lexicon-based approaches achieve high performance by leveraging multiple feature types and domain adaptation:
- On Turkish movie reviews (14,000 balanced), SVM with 3-features yields 90.98% accuracy; ensemble methods combining all features reach 91.17%.
- Turkish Twitter (1,716 balanced): 80.59% accuracy, outperforming prior reported baselines by ~3%.
- English movie and Twitter datasets tested for cross-linguistic generalization.
- ABSA (SemEval-2014 Task 4): Ensemble RNN+RecNN models achieve 80.9% (restaurant) and 76.15% (laptop) accuracy, surpassing baseline RNN approaches (Aydin, 29 Nov 2025).
7. Access, Usage, and Community Impact
SentiTurca datasets and lexicon/toolkit are accessible via open repositories, with direct Python API examples for loading, preprocessing, feature extraction, and downstream training/evaluation pipelines. Integration with standard Turkish NLP libraries (e.g., Zemberek) and established ML suites (e.g., scikit-learn) enables immediate deployment in research and development pipelines.
SentiTurca, in both benchmark and lexicon form, provides a pivotal resource for the Turkish NLP community by establishing standardized, high-reliability sentiment datasets and rich polarity resources tailored to the unique challenges of Turkish morphology and lexicon, thus supporting robust model evaluation and facilitating improved system development (Altinok, 26 Dec 2025, Aydin, 29 Nov 2025).