TALES-QA: Cultural and Narrative QA Benchmark
- TALES-QA is a benchmark designed to assess large language models’ ability to answer narrative and culture-specific questions using annotated misrepresentation spans and varied question formats.
- It employs a rigorous two-stage verification protocol with expert annotators ensuring high validity, uniqueness, and reliable performance across English and Indic languages.
- Experimental insights reveal that while models often possess correct cultural facts, they struggle to apply this knowledge contextually in generative tasks.
TALES-QA refers to a set of recent methodologies, datasets, and evaluation protocols designed for assessing and improving LLMs’ ability to answer questions about narratives, games, and stories, especially in contexts requiring robust reasoning or domain-specific knowledge such as cultural competence. Across multiple lines of research, TALES-QA marks a shift from generic QA to context-grounded evaluation targeting nuanced reading, reasoning, or cultural fidelity, often by leveraging synthetic or authentic story corpora and focusing on failure diagnosis in LLM outputs.
1. TALES-QA in Cultural Representation: Design and Construction
TALES-QA, as introduced in the context of cultural misrepresentation in LLM-generated stories, is a question-answer benchmark extracted from the TALES-Tax framework of seven misrepresentation categories: Cultural Inaccuracy, Unlikely Scenario, Clichés, Oversimplification, Factual Error, Linguistic Inaccuracy, and Logical Error (Bhagat et al., 26 Nov 2025). The central objective is to decouple a model’s factual cultural knowledge from its application (or misapplication) in creative narrative tasks. To construct TALES-QA, researchers began with 2,925 human-annotated misrepresentation spans from model-generated stories. GPT-4.1 was prompted to generate one question per span, formatted as one-word answer, fill-in-the-blank, multiple-choice, true/false, or one-phrase completion.
A two-stage verification protocol involved 61 expert annotators (44 Indic native speakers, 17 English experts) who assessed each question on validity, uniqueness, cultural grounding, and answer correctness. Where a question failed any dimension, annotators revised or discarded it. Inter-annotator agreement exceeded 68% on all dimensions, with a maximum of 85.2% for validity, ensuring high reliability.
2. Dataset Statistics and Coverage
TALES-QA comprises 1,683 questions: 568 in English and 1,115 across 13 Indic languages. The distribution by question format and language is described in the following table.
| Format | English | Indic | Total |
|---|---|---|---|
| Multiple-Choice | 165 | 153 | 318 |
| Fill-in-the-Blank | 17 | 82 | 99 |
| True/False | 80 | 158 | 238 |
| One-word Answer | 289 | 676 | 965 |
| One-phrase Answer | 17 | 46 | 63 |
Questions cover over 71 city/town/village contexts and are linked to specific misrepresentation types. For example, a question querying where a particular statue is located (Cultural Inaccuracy), or identifying which traditional food is not served at a wedding feat (Unlikely Scenario), directly tests factual knowledge that would prevent generation errors (Bhagat et al., 26 Nov 2025).
3. Evaluation Protocols and Metrics
Each model is prompted according to question style—for instance, “Answer in one word” for completion items or “Choose A/B/C” for multiple-choice—and GPT-4o serves as an automatic rater, allowing for script or minor variation. Models generate five independent samples per question. Correctness is determined by majority match to the gold reference. Accuracy is the principal evaluation metric:
A key protocol advantage is scoring that is largely automatic and scalable, enabling robust cross-model comparison and granular tracking of multilingual or category-wise knowledge.
4. Experimental Results and Insights
TALES-QA’s experimental campaign assessed six leading LLMs across all 1,683 items (Bhagat et al., 26 Nov 2025). In English, accuracy averaged 76.9% (model-wise range: 69.4%–86.3%), whereas Indic languages averaged 59.8% (range: 41.0%–74.1%), demonstrating a consistent ~17-point deficit for low/mid-resource languages. Best performers were Gemini 2.5 Pro (86.3% English, 74.1% Indic), LLaMA 3.3 (82.2% English, 66.1% Indic), and GPT-4.1 (79.4% English, 62.1% Indic). Restricting evaluation to questions derived from a model’s own misrepresentations produced similar results, indicating no selection skew.
Crucially, although 88% of generated stories contain one or more cultural misrepresentations, models usually “know” the underlying cultural facts (e.g., correct answers to TALES-QA items) but fail to reliably apply this knowledge during generative tasks. This suggests a central challenge is not parametric knowledge, but consistent, contextually appropriate deployment during open-ended generation.
5. Narrative and Reasoning QA: Pedagogical Foundations
TALES-QA methodology is informed by narrative comprehension research, as exemplified in FairytaleQA (Xu et al., 2022). Successful QA benchmarks for narrative comprehension emphasize questions targeting explicit (surface) and implicit (inferential) information, spanning a standardized set of elements: Character, Setting, Action, Feeling, Causal Relationship, Prediction, and Outcome Resolution. Rigorous annotation protocols involve section-grounded questions, expert validation, and subskill labeling, enabling fine-grained assessment of model reading and reasoning skills. For comparison, in FairytaleQA, 10,580 QA pairs are partitioned into 74.5% explicit and 25.5% implicit, mapped across the seven narrative categories.
Adopting such frameworks in TALES-QA design supports more differentiated evaluation—not only by outcome, but by subskill and knowledge type.
6. Extension to Reasoning in Interactive Environments
TALES-QA methodologies are closely coupled with the evaluation of reasoning in text-adventure tasks, as represented in the TALES suite of 122 games (Cui et al., 19 Apr 2025). These tasks target entity tracking, sequential decision-making, temporal dependencies, and compositional reasoning—spanning synthetic environments (Simon Says, TextWorld, ScienceWorld) and human-authored games (Jericho). Performance plateaus on synthetic games (~95–100% success for top models) but remains below 15% on human-crafted stories, underscoring a persistent gap in grounded, context-driven reasoning over long horizons.
A plausible implication is that integrating TALES-QA—measuring discrete cultural or narrative knowledge—can serve as a diagnostic adjunct to agent-based evaluation, helping identify whether reasoning failures are due to knowledge gaps or online reasoning limitations.
7. Implications and Future Directions
TALES-QA’s principal finding is that while LLMs possess much of the parametric knowledge necessary for high-fidelity narrative and culturally competent generation, their failures are often rooted in application rather than recall. Addressing these deficits may require new strategies in alignment, prompt engineering, or training objectives designed to reinforce faithful and contextually grounded deployment of known facts. Future expansions of TALES-QA are likely to incorporate adversarial, memory-augmented, or human-in-the-loop settings, as well as benchmarks for cumulative, multi-turn application of discrete knowledge in narrative, interactive, or generative tasks (Cui et al., 19 Apr 2025, Bhagat et al., 26 Nov 2025, Xu et al., 2022).