Papers
Topics
Authors
Recent
2000 character limit reached

AHaSIS Shared Task: Arabic Sentiment Analysis

Updated 26 November 2025
  • AHaSIS Shared Task is an evaluation campaign for three-way sentiment classification of Arabic hotel reviews, emphasizing dialectal variations and hospitality-domain challenges.
  • The task uses a meticulously translated and validated dataset from Modern Standard Arabic to Saudi and Moroccan Darija, ensuring cross-dialect accuracy.
  • Top-performing systems, employing transformer ensembles and data-efficient methods like SetFit, achieved micro-F1 scores up to 0.81, setting new benchmarks in Arabic sentiment analysis.

The AHaSIS Shared Task is an international evaluation campaign addressing three-way sentiment classification (positive, neutral, negative) of hotel reviews written in Arabic dialects, specifically Saudi Arabic and Moroccan Darija. Designed to test state-of-the-art sentiment analysis methods under low-resource and dialect-rich scenarios, AHaSIS centers on hospitality-domain language and emphasizes dialectal robustness, scalability to real-world feedback, and reproducible benchmarking. The challenge features a manually curated dataset derived from Modern Standard Arabic (MSA) reviews, translated and validated into two spoken dialects, and draws participation from over forty research teams. Evaluation is conducted using standard metrics, with micro-F1 as the primary criterion, and a leaderboard capturing diverse NLP strategies.

1. Task Definition and Evaluation Protocol

The central goal is three-way sentiment detection on Arabic hotel reviews in Saudi and Darija dialects, requiring models to classify review sentences as positive, neutral, or negative. The key scientific objectives are to quantify the effectiveness of traditional versus neural approaches (including fine-tuned transformers and few-shot LLM prompting), assess model resilience to dialectal variation, and support systematic progress in dialect-aware Arabic sentiment analysis (Alharbi et al., 17 Nov 2025).

The metrics employed are standard and defined as:

  • Precision: TPTP+FP\frac{TP}{TP + FP}
  • Recall: TPTP+FN\frac{TP}{TP + FN}
  • F1-score: 2×Precision×RecallPrecision+Recall2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}

where TPTP, FPFP, and FNFN denote true positives, false positives, and false negatives, respectively. Task ranking is based on the micro-averaged F1-score across all sentiment classes.

2. Dataset Construction and Dialectal Validation

The dataset foundation is the SemEval-2016 ABSA-Hotels corpus in MSA, which underwent rigorous manual translation and validation into both Saudi and Moroccan Darija dialects. Each translation was performed by bilingual annotators, with dialectal accuracy and sentiment preservation verified by two independent native speakers per entry; disagreements were adjudicated jointly. Annotation guidelines instructed translators to retain original sentiment, idiomatic usage, and hospitality-domain vocabulary. Neutral labels were reserved for sentences without unambiguous polarity. A final consistency pass enforced label and dialectal fidelity.

Dataset statistics:

Split Entries Dialects Sentiment Labels
Train 860 Saudi, Darija Positive, Neutral, Negative
Test 216 Saudi, Darija N/A (to be predicted)

The training set contains 860 balanced examples (\approx286/class/dialect), with 216 held out for evaluation. The resource is validated for cross-dialectal and pragmatic authenticity (Alharbi et al., 17 Nov 2025, Zarnoufi, 19 Nov 2025).

3. Baselines and Official Leaderboard

No single official baseline was mandated. Organizers provided a TF-IDF + linear SVM reference (micro-F1 ≈ 0.68), employing orthographic normalization, tokenization, English token removal, unigram/bigram TF-IDF features, and class-weighted SVM. Task participants were encouraged to surpass this baseline via advanced modelling.

Twelve teams ultimately submitted final systems. The leaderboard of top results is summarized below:

Rank Team Name Micro-F1
1 Hend (iWAN-NLP) 0.810
2 ISHFMG_TUN 0.7916
3 LBY 0.790
12 MAPROC 0.730

The best result was 0.81 micro-F1, and the 12th-ranked system achieved 0.73, outperforming the provided baseline. Over forty teams registered for the evaluation phase (Alharbi et al., 17 Nov 2025, Zarnoufi, 19 Nov 2025).

4. Characteristics of Submitted Systems

The top-performing systems illustrate several architectural and methodological trends:

  • Hend (iWAN-NLP), F1=0.81: Ensemble of MARBERTv2, SaudiBERT, and DarijaBERT, with stratified five-fold cross-validation per base model and logit averaging across models/folds. Training employed label smoothing (ϵ=0.1\epsilon=0.1), mixed-precision, early stopping, and gradual learning-rate warmup. Ensemble averaging was used for inference.
  • ISHFMG_TUN, F1=0.7916: Fine-tuned AraBERTv02, freezing lower transformer layers, class-weighted cross-entropy, optimized dropout ($0.3$), and a cyclic learning rate. Single-model, no ensembling or external augmentation.
  • LBY, F1=0.79: Six Arabic transformer variants were tested; MARBERTv2 selected by hyperparameter sweep. Per-dialect and combined-dialect fine-tuning, tuned batch size ($16$–$32$) and learning rate (2×1052\times10^{-5} to 5×1055\times10^{-5}), single-model.
  • LahjaVision (Rank 4): QARiB backbone, dialect embeddings, focal loss, discriminative fine-tuning for dialect differentiation.
  • AraNLP (Rank 5): Fusion of AraELECTRA embeddings and TF-IDF features before classification.
  • MucAI (Rank 6): Adaptive few-shot prompting with GPT-4o, utilizing kNN retrieval from AraBERT embeddings with chain-of-thought rationales.
  • MAPROC (Rank 12): SetFit (Sentence Transformer Fine-tuning) framework with Arabic-SBERT-100K, leveraging only 64 examples per class in contrastive fine-tuning, followed by a logistic regression head (Zarnoufi, 19 Nov 2025).

Data augmentation (e.g., paraphrasing, probabilistic lexical perturbation, pattern generation) appeared in MARSAD/MARSAD AI systems. Others integrated dialect-aware embeddings, focal losses, or explicit kNN-retrieved rationale chains. Few-shot and data-efficient methods, as shown in MAPROC, achieved strong results (0.73 F1 with 192 labels) (Alharbi et al., 17 Nov 2025, Zarnoufi, 19 Nov 2025).

5. Empirical Results and Error Patterns

Cross-dialect performance: Systems performed comparably on Saudi and Darija reviews, with minor F1 degradation on Darija (≈0.03), attributed to less pre-training coverage in foundation models. Ensembles incorporating both dialect-specific and generalist models (as in iWAN-NLP) reduced this gap to ≈0.01 F1, highlighting the utility of dialect-specialized pre-training.

Sentiment-class confusion: The majority of misclassification errors involved confusion between neutral and polar classes (≈15% of errors), a consequence of ambiguous "weak" polarity markers (e.g., "so so", "معقول"). Darija idioms with strong positive/negative connotations were commonly misclassified by models lacking dialect-specific resources.

Influential factors:

  • Data scarcity and augmentation: 860 training examples define a low-resource regime; data augmentation improved F1 by ≈0.05 in several cases.
  • Dialect variability: Systems exploiting dialect-aware embeddings or lexicons observed systematic improvements.
  • Model complexity versus efficiency: Compact, well-tuned single-model systems (ISHFMG_TUN) achieved near state-of-the-art with reduced computational requirements compared to ensembles or LLM prompting.

MAPROC's internal analysis noted particular difficulty in "neutral" Darija, with F1 markedly lower (61.3%) than positive or negative (85%+), reflecting gaps in pre-training for Moroccan dialectal coverage and idiomaticity (Zarnoufi, 19 Nov 2025).

6. Methodological Innovations: SetFit and Few-Shot Approaches

MAPROC's submission provided an explicit case paper in sentence transformer fine-tuning for low-resource sentiment analysis (Zarnoufi, 19 Nov 2025). The SetFit protocol operates in two stages:

  1. Contrastive fine-tuning: Arabic-SBERT-100K is fine-tuned using supervised contrastive loss on sentence pairs, where pairs from the same class serve as positives; other pairs are negatives. The loss function:

Li=1P(i)pP(i)logexp(sim(zi,zp)/τ)aA(i)exp(sim(zi,za)/τ)L_i = -\frac{1}{|P(i)|} \sum_{p \in P(i)} \log \frac{\exp(\mathrm{sim}(z_i,\,z_p)/\tau)}{\sum_{a \in A(i)}\exp(\mathrm{sim}(z_i,\,z_a)/\tau)}

where ziz_i is the embedding, P(i)P(i) and A(i)A(i) the positive and all other indices, and τ\tau the temperature.

  1. Classification: Final representations are classified with a logistic regression head, trained using cross-entropy loss:

LCE=c=1Cyc  logy^cL_{\text{CE}} = -\sum_{c=1}^C y_c\;\log \hat y_c

With only 64 shots per class (192 total labels), this approach achieved 0.73 F1, surpassing the baseline and demonstrating data efficiency. Observed limitations included persistent neutral-class confusion and the lack of intensive hyperparameter tuning or domain-specific augmentation.

7. Significance, Limitations, and Future Directions

AHaSIS demonstrated that strong three-way sentiment classification across Arabic dialects in the hospitality domain is possible under constrained training regimes (micro-F1 up to 0.81). Transformer ensembles with dialect-specific pre-training showed the highest effectiveness, while data-efficient methods, especially SetFit, provided competitive results with minimal labels.

Persistent challenges remain regarding the detection of subtle polarity and the neutral class, especially within underrepresented idiomatic expressions. For future evaluation cycles, recommendations include expanding to further dialects (Egyptian, Levantine, broader Maghrebi), incorporating aspect-based and emotion detection, developing larger annotated corpora with pragmatic markers, standardizing reproducible baselines, and exploring advanced techniques for cross-dialectal adaptation.

By establishing an openly available, validated resource and benchmark suite, and capturing a wide array of modern approaches, the AHaSIS shared task fosters continued research into dialect-sensitive, low-resource Arabic NLP (Alharbi et al., 17 Nov 2025, Zarnoufi, 19 Nov 2025).

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to AHaSIS Shared Task.