BanglaRiddleEval: Bangla Riddle Benchmark
- BanglaRiddleEval is a comprehensive benchmark evaluating LLM reasoning on traditional Bangla riddles via four structured tasks.
- It employs a robust dataset of 1,244 human-authored riddles curated to challenge metaphoric, idiomatic, and culturally embedded language understanding.
- The evaluation framework uses zero-shot, few-shot, and chain-of-thought prompting to reveal LLM limitations compared to human performance.
BanglaRiddleEval is a comprehensive benchmark designed to rigorously assess the reasoning capabilities of LLMs on traditional Bangla (Bengali) riddles. These riddles, characterized by figurative, culturally grounded, and often ambiguous language, serve as an incisive probe into the limitations of current LLMs in low-resource, non-Western NLP contexts. The benchmark comprises 1,244 human-authored riddles, systematically instantiated across four distinct evaluation tasks, and is supported by a fully documented, reproducible annotation and evaluation pipeline. BanglaRiddleEval represents the first large-scale, multi-faceted evaluation suite targeting Bangla riddle reasoning, with all resources available via open repository (Sayeedi et al., 23 Dec 2025).
1. Motivation and Benchmark Design
BanglaRiddleEval was motivated by observed deficiencies in LLMs’ ability to process figurative language, homophones, and culture-laden metaphors in low-resource settings. Existing multilingual benchmarks seldom account for the nuanced demands of regional oral and folk traditions, especially riddles, which demand more than pattern-matching—they require idiomatic, context-sensitive reasoning. To address this gap, BanglaRiddleEval introduces a battery of tasks around 1,244 folk-riddle items, offering novel, multi-perspective challenges: generative question answering, multiple-choice distractor robustness, semantic ambiguity resolution, and chain-of-thought explanation (Sayeedi et al., 23 Dec 2025).
2. Dataset Construction and Task Suite
Collection and Curation
Source riddles were transcribed from three canonical Bangla folk-riddle volumes, processed via high-resolution OCR, and cleaned to yield 1,244 unique (riddle, answer) pairs. Processing included deduplication, orthographic normalization, and correction of transcription artifacts. This approach ensures coverage of both classical and colloquial idioms, maximizing cultural and linguistic representativeness.
Task Instantiation
Each riddle is evaluated along four complementary tasks, yielding 4,976 total artifacts:
- Generative QA: Free-form answer generation for open-ended riddle solutions.
- MCQ: One correct answer plus three adversarial, semantically plausible distractors, designed to obfuscate surface-level logic.
- Semantic Ambiguity: Identification of the intended sense of a “trigger” word, selected from four closely related candidates.
- Chain-of-Thought Explanation: LLMs produce a prescribed four-step reasoning trace for each riddle-answer pair: (1) Answer Identification → (2) Metaphor Explanation → (3) Connection to Answer → (4) Conclusion.
Annotation and Artifact Generation
Annotation leverages a high-capacity LLM (GPT-40) orchestrated through an iterative, filter-and-refine loop. Distractors and ambiguity labels are LLM-generated but receive human spot-checks to eliminate inadvertent alternative answers. The pipeline maintains semantic faithfulness by enforcing structured output schemas, ensuring consistency across the dataset (Sayeedi et al., 23 Dec 2025).
3. Evaluation Pipeline and Experimental Methodology
Model Suite
Evaluations cover both open- and closed-source LLMs:
- Open-source: GPT-OSS-20B, DeepSeek-R1-7B, DeepSeek-R1-14B, Qwen3-4B, Qwen3-8B, Qwen3-14B, Gemma3-4B, Gemma3-12B.
- Closed-source: Gemini-2.5-Flash.
Prompting Regimes
Three prompting strategies are investigated:
- Zero-Shot: Minimal context, task description plus riddle.
- Few-Shot: Three in-context solved examples per task.
- Chain-of-Thought (CoT): “Think step by step...” prompts augmented with intermediate reasoning exemplars.
Evaluation Metrics
All tasks employ mathematically explicit metrics:
- MCQ and Ambiguity: Accuracy, defined as .
- Generative QA: BERTScore F1 for semantic overlap; a separate LLM-as-Judge for correctness (binary).
- Explanation Quality: LLM-as-Judge scoring in , reliability calibrated against Bangla-fluent human annotators (92–95% inter-annotator agreement).
This multi-metric design captures both surface-level and deep semantic performance, and the use of LLM-as-Judge aligns the evaluation scale with the complexity of Bangla riddle reasoning (Sayeedi et al., 23 Dec 2025).
4. Empirical Findings and Error Patterns
Aggregate Results
- Generative QA: BERTScore F1 in the 0.74–0.81 range; LLM-as-Judge accuracy is low (2–29%), with Gemini-2.5-Flash peaking at ∼29%.
- MCQ: Accuracy spans 24–56% (random: 25%). Best LLM model (Gemini-2.5-Flash, CoT) achieves 56%, far below human baseline (∼83%).
- Semantic Ambiguity: Accuracy 26–68%, best by Gemini-2.5-Flash in zero-shot (68%).
- Explanation: LLM-as-Judge scores from 0.9 (bottom-tier) to 8.7 (Gemini-2.5-Flash).
A consistent shortfall from human performance is observed, especially under adversarially constructed distractors and in resolving cultural or idiomatic ambiguity (Sayeedi et al., 23 Dec 2025).
Recurring Pitfalls
- Homophone traps: E.g., “lying-fox” interpreted as fractional quantity instead of count, mispad parsing leading to answers like 2.5 instead of 2.
- Arithmetic metaphors: Riddles involving fractions or manipulations are often misparsed.
- Cultural idioms: Models are frequently derailed by references to local tools or foods outside mainstream pretraining corpora.
- Script-specific challenges: In related BengaliFig experiments, grapheme-count constraints in riddles induce steep accuracy drop-offs, underscoring weaknesses in Bengali script sensitivity (Sefat, 25 Nov 2025).
Cross-Benchmark Context
Multiple independent benchmarks, including "The Riddle of Reflection" and BengaliFig, confirm that even frontier models sustain accuracy plateaus (typically 40–55% in Bangla), with metaphorical, wordplay, and culturally-specific riddles being hardest (M et al., 2 Nov 2025, Sefat, 25 Nov 2025).
5. Illustrative Examples
| Task | Example Riddle (Translation) | Ground Truth & Model Output |
|---|---|---|
| Homophone | “One fox lies on one bank, another on the other; How many foxes?” | Correct: 2; Model: 2.5 (error) |
| MCQ | “Enter the room, go to the veranda, a sudden sound makes you jump.” [Stirring-stick, etc.] | Correct: Stirring-stick; Model: Frying-pan |
| Ambiguity | Trigger: “হাত” (“hand”); Candidates: “hand,” “sleeve,” “claw,” “branch” | Correct: “sleeve” |
| CoT Expl. | “In a dark room, a monkey dances; saying ‘no no’ makes it dance more.” (Tongue) | Four-step reasoning as specified |
These examples underscore the frequent mismatch between LLMs’ surface-level pattern extraction and the deep, context-dependent reasoning required for riddle solutions (Sayeedi et al., 23 Dec 2025).
6. Insights, Limitations, and Future Directions
The findings highlight both progress and marked gaps in LLM handling of figurative and culturally grounded reasoning in Bangla. Key observations include:
- State-of-the-art models capture surface patterns and some metaphoric cues but do not reach human-level inference, especially with adversarial distractors and culturally opaque references.
- Few-shot and semantic-similarity prompting yield only marginal, inconsistent improvements, echoing analogous results from related Indian-language riddle evaluations (M et al., 2 Nov 2025).
- High-performing models tend toward overconfidence in their answers, while accuracy is inversely correlated with self-awareness metrics (true negative rate) (M et al., 2 Nov 2025).
Proposed future work includes expanding coverage to dialectal and oral-tradition variants, exploring multimodal clues, incorporating Bangla-centric pretraining or adapter modules for better cultural embedding, and developing reflection-aware training procedures to jointly optimize generation and error detection (Sayeedi et al., 23 Dec 2025, M et al., 2 Nov 2025, Sefat, 25 Nov 2025).
7. Resource Availability and Impact on Low-Resource NLP
All data artifacts, annotation scripts, and model outputs for BanglaRiddleEval are available at https://github.com/Labib1610/BanglaRiddleEval. The repository includes usage, licensing, and inference instructions to support further research and reproducibility. BanglaRiddleEval, together with complementary datasets such as BengaliFig and Indian Riddle subsets, establishes a paradigm for culturally informed, multi-perspective evaluation of LLMs in low-resource languages. Emphasis on script awareness, figurative reasoning, and human-in-the-loop annotation is recommended for future benchmarks aiming to advance robustness and inclusivity in NLP evaluation (Sayeedi et al., 23 Dec 2025, M et al., 2 Nov 2025, Sefat, 25 Nov 2025).