BanglishRev: Bengali-English Reviews
- BanglishRev is a comprehensive e-commerce review dataset targeting Bengali speakers, containing 1.75M reviews in Bengali, English, Banglish, and code-mixed texts.
- The dataset includes detailed metadata such as user IDs, star ratings, product categories, timestamps, and image URLs, enabling advanced analyses like sentiment and recommendation studies.
- Data was collected using automated scraping and normalized with language-specific techniques, with models like BanglishBERT achieving 94% accuracy in sentiment analysis.
BanglishRev is the largest publicly available e-commerce product review dataset targeting the Bengali-speaking market, comprising 1,747,043 written reviews sourced from 3,239,811 total rating events across 128,543 distinct products. It encompasses reviews written in Bengali script, English, mixed Bangla-English, and Banglish—defined as Bengali words romanized in Latin script. The corpus is extensively annotated with rich metadata for each review and product, supporting advanced tasks in sentiment analysis, natural language processing, recommendation systems, and multimodal analytics (Shamael et al., 17 Dec 2024).
1. Corpus Composition and Annotation Schema
BanglishRev aggregates review data from major online platforms and features sophisticated record-level metadata. Each review entry includes a randomized user ID, product ID, integer star rating (1–5), review text, post date, purchase date, counts of likes/dislikes, seller replies (if any), and a list of image URLs. At the product level, each JSON record contains categorical descriptors (root-category, parent-category, sub-category), average rating (float), a star distribution vector (five integers for 1–5 star counts), and total review counts.
All fields are encoded: date fields use ISO 8601 format, textual fields are UTF-8, numeric types are strictly defined, and any personally identifiable information has been removed by replacement with randomly generated integers. Three main files are available: category_links.csv, products.json, and reviews.json, each with a clearly specified schema.
| File | Key Fields (example) | Data Type/Format |
|---|---|---|
| category_links.csv | root_category, parent_category, url | string |
| products.json | product_id, average_rating, rating_counts | int, float, object |
| reviews.json | review_id, rating, review_text, image_urls | int, string, list[string] |
2. Data Collection and Preprocessing
Data extraction applied three automated scraping stages utilizing Selenium and BeautifulSoup. Initially, the Daraz Bangladesh site’s hierarchical categories were acquired and stored. Subsequently, paginated sub-category product listings were crawled to gather product URLs and ratings. Finally, per-product review pages were iterated to collect individual ratings and review metadata.
The raw textual reviews are unnormalized in public release; for downstream tasks, an optional normalization pipeline is demonstrated whereby emojis and non-terminal punctuation are stripped, reviews are language-typed using regex for Bangla scripts and NLTK’s English lexicon, Banglish tokens are converted to phonetic Bangla via avro-py, and extra whitespace is collapsed. This enables fine-grained separation of review types and supports linguistically aware preprocessing.
3. Language Composition and Statistical Profile
Applying the text-type classifier to the 1.75M reviews yields the following breakdown:
- 437,470 reviews (25.04%) in Bangla script
- 546,363 reviews (31.27%) in English
- 103,436 reviews (5.92%) code-mixed Bangla-English
- 659,774 reviews (37.77%) Banglish
A pronounced skew toward positive feedback is evident: over 78% of all reviews are five-star, with the next largest frequency at four stars; one- and two-star reviews are rare. The distribution of review volume is sector-dependent, with "Health & Beauty" and "Groceries" receiving significantly more reviews than "Automotives" and "Home Appliances." Analysis of review length confirms that many reviews are terse, often limited to two words (e.g., “Very good”, “Nice product”). Word clouds generated for Bangla and English subsets consistently list “খুব ভালো” and “good” as predominant expressions.
4. Sentiment Analysis and Model Validation
Experimentation with BanglishRev for sentiment analysis tasks demonstrates its utility for model training and validation. The binary sentiment task defines ratings above three as positive and three or below as negative, with alternative thresholds also explored. The BanglishBERT model, a BERT-style encoder leveraging ELECTRA’s replaced-token-detection objective and a 256-token input limit, is fine-tuned on BanglishRev using Adam (learning rate ), batch size 128, for three epochs (NVIDIA A100, 40 GB VRAM).
On an external, manually annotated 78k-review dataset (Rashid et al. 2024) with an 80/20 split, the fine-tuned BanglishBERT achieved:
- Accuracy: 0.94
- Weighted F1: 0.94
In comparison, a baseline BERT, trained only on the smaller manual set, yielded slightly lower metrics (Accuracy 0.93, Weighted F1 ≈ 0.93). This suggests that rating-based labeling on large noisy corpora can rival manually annotated datasets in sentiment modeling efficacy.
5. Access, Licensing, and Usage
BanglishRev is hosted on Hugging Face under a non-commercial, academic-use license. The dataset is readily accessible using:
1 2 |
from datasets import load_dataset dataset = load_dataset("BanglishRev/bangla-english-and-code-mixed-ecommerce-review-dataset") |
Researchers are directed to cite Shamael, M. N., Nawshin, S., Shatabda, S., & Islam, S. (2024) when utilizing the corpus (Shamael et al., 17 Dec 2024).
6. Research Applications and Future Work
BanglishRev’s multimodal data structure opens diverse research opportunities. Potential avenues include spam-review detection (e.g., detecting suspicious bursts of identical five-star submissions), aspect-based opinion mining, cross-cultural or cross-regional analysis of image-based reviews, customer satisfaction modeling utilizing image similarity, and user-behavior analytics stratified by product category.
The heavy bias toward positive feedback provides a test environment for imbalance learning and anomaly detection methodologies. The dataset’s scale and annotation diversity also facilitate studies in fine-grained sentiment differentiation, emotion recognition, aspect labeling, and their integration in NLP, recommender systems, and market analysis pipelines. Curating high-quality annotations for these tasks would further augment BanglishRev’s research utility.