Papers
Topics
Authors
Recent
2000 character limit reached

Bangla ASTE: Aspect-Sentiment Extraction

Updated 3 December 2025
  • Bangla ASTE is a manually curated dataset featuring detailed annotations of aspect-sentiment-opinion triplets from Bangla e-commerce reviews.
  • It comprises 3,345 reviews annotated via a multi-stage expert consensus protocol, covering aspects such as battery life, camera quality, service, pricing, and packaging.
  • The dataset supports fine-grained sentiment analysis using a hybrid ensemble framework that achieved up to 89.9% accuracy, making it valuable for both research and real-world applications.

BanglaASTE is the first large-scale, manually annotated dataset for Aspect–Sentiment–Opinion (triplet) extraction from Bangla e-commerce reviews. It operationalizes Aspect-Based Sentiment Analysis (ABSA) for Bangla, addressing longstanding gaps in both resources and frameworks required for fine-grained sentiment mining in this low-resource language. The dataset and its associated extraction framework were introduced and evaluated in "BanglaASTE: A Novel Framework for Aspect-Sentiment-Opinion Extraction in Bangla E-commerce Reviews Using Ensemble Deep Learning" (Islam et al., 26 Nov 2025). BanglaASTE comprises 3,345 product reviews sampled across multiple Bangla e-commerce platforms, annotated using a multi-stage expert consensus protocol, and supports both research and production applications in sentiment analytics.

1. Data Acquisition and Preprocessing

BanglaASTE reviews were sourced from Daraz, Facebook, Rokomari, Shajgoj, and a minor “Other” category, targeting broad representation of consumer feedback across electronics, fashion, appliances, and cosmetics. Automated web-scraping—combining HTML parsing and official APIs—was complemented by manual sampling to enforce diversity and site compliance. Preprocessing included removal of non-Bangla content, stray punctuation, URLs, and excess whitespace. Spelling variations (e.g., synonyms and orthographic inconsistencies) were harmonized using a reference lexicon from the Bangla Academy. Tokenization relied on a native Bangla NLP module. Manual filtering eliminated spam, ultra-short, off-topic, and emoji-overloaded entries.

Platform Total Reviews
Daraz 2,431
Facebook 467
Rokomari 273
Shajgoj 82
Other 92

This suggests a deliberate balancing of review diversity while maximizing data quality and representativeness.

2. Annotation Schema and Protocol

Reviews were annotated for one or more triplets, each comprising an aspect term, opinion expression, and sentiment polarity. Aspect terms denote explicit product features ("ব্যাটারি লাইফ", "ক্যামেরার কোয়ালিটি"); opinion expressions reflect subjective evaluations ("দীর্ঘস্থায়ী", "অস্পষ্ট"); sentiment polarity is categorized as Positive, Negative, or Neutral.

Annotation followed a two-stage majority-voting protocol: two postgraduate CS annotators independently labeled each review, with an NLP expert adjudicating conflicts. A third expert resolved residual disagreements, establishing consensus. Although kappa agreement scores were not reported, this multi-tiered procedure maintained high consistency across the corpus.

Triplets are stored in JSON (and CSV) as:

1
2
3
4
5
6
7
{
  "id": 1023,
  "text": "এই ফোনের ব্যাটারি লাইফ খুব ভালো",
  "triplets": [
    {"aspect": "ব্যাটারি লাইফ", "opinion": "ভাল", "sentiment": "Positive"}
  ]
}

A plausible implication is that these annotation guidelines establish a reproducible schema for subsequent Bangla ABSA datasets.

3. Dataset Structure, Statistics, and Splitting

BanglaASTE contains 2,966 annotated triplets covering five primary product aspects. Neutral triplets are underrepresented and thus omitted from breakdown statistics. The corpus does not employ fixed train/validation/test splits; instead, five-fold cross-validation maximizes robustness given corpus size constraints.

Category Total Positive Negative
Battery Life 732 412 320
Camera Quality 598 301 297
Service 491 278 213
Pricing 824 502 322
Packaging 321 189 132

Data are available in both JSON and CSV formats; each record contains a review ID, review text, and a list of annotated triplets.

4. Frameworks, Metrics, and Baselines

Sentiment extraction and triplet identification were evaluated using standard classification metrics: accuracy, precision, recall, and F1-score, where

precision=TPTP+FP\text{precision} = \frac{TP}{TP + FP}

recall=TPTP+FN\text{recall} = \frac{TP}{TP + FN}

F1=2×precision×recallprecision+recall\text{F1} = 2 \times \frac{\text{precision} \times \text{recall}}{\text{precision} + \text{recall}}

Baseline models included a BiLSTM and BanglaBERT. The principal extraction system is a hybrid ensemble employing BanglaBERT contextual embeddings with XGBoost boosting. Performance metrics for sentiment classification and triplet extraction are summarized below:

Model Accuracy Precision Recall F1-Score
BiLSTM 64.2% 65.0% 62.4% 63.1%
BanglaBERT 69.5% 70.2% 67.9% 68.7%
Proposed (Ensemble) 89.9% 88.4% 87.5% 89.1%

Triplet extraction via the ensemble achieved precision 86.1%, recall 84.5%, and F1 85.3%. This establishes a significant performance advantage over RNN-based baselines.

5. Use Cases and Applications

BanglaASTE underlies several applications:

  • Fine-grained e-commerce analytics, for example, tracking sentiment about “pricing” or “camera quality” over time.
  • Automated customer-feedback dashboards tailored for Bangla retailers.
  • Market-research tools that surface product strengths and weaknesses from consumer reviews.

By supporting aspect-targeted sentiment mining, BanglaASTE provides a scalable analytics backbone for Bangla e-commerce and market intelligence systems.

6. Challenges, Limitations, and Implications

The dataset and framework address principal obstacles in Bangla NLP: informal spellings, pervasive code-mixing, and resource sparsity. Notable limitations include the relatively modest corpus size compared to English ASTE datasets, difficulty in correctly annotating sarcasm and idiomatic expressions, and underrepresentation of neutral sentiment. This suggests further annotation rounds and expanded coverage would benefit generalized model training and evaluation.

By providing rigorously annotated triplets, BanglaASTE establishes foundational infrastructure for low-resource language ABSA and practical analysis pipelines in Bangla-speaking digital markets (Islam et al., 26 Nov 2025). Future research is likely to focus on scaling corpus size, enhancing idiomatic/sarcastic sentiment detection, and expanding aspect category representation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Bangla ASTE Dataset.