Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
91 tokens/sec
Gemini 2.5 Pro Premium
52 tokens/sec
GPT-5 Medium
24 tokens/sec
GPT-5 High Premium
28 tokens/sec
GPT-4o
85 tokens/sec
DeepSeek R1 via Azure Premium
87 tokens/sec
GPT OSS 120B via Groq Premium
478 tokens/sec
Kimi K2 via Groq Premium
221 tokens/sec
2000 character limit reached

Book-Plot Retrieval Dataset

Updated 5 August 2025
  • The book-plot retrieval dataset is a curated benchmark that evaluates retrieval models on linking abstract queries with contextually situated narrative passages.
  • It emphasizes low lexical overlap and contextual dependency by employing techniques like situated context windows and margin-based contrastive loss.
  • The dataset drives advancements in narrative understanding, book discovery, and retrieval-augmented generation while highlighting the human–machine gap in abstract semantic association.

A Book-Plot Retrieval Dataset is a specialized benchmark designed for evaluating information retrieval (IR) systems on the task of retrieving relevant segments or “plots” from long narrative texts, such as books, in response to natural language queries that tend to be abstract, context-dependent, or semantically rich. Unlike traditional IR datasets reliant on direct lexical overlap, book-plot retrieval benchmarks emphasize the capacity of retrieval models to bridge abstract semantic associations, handle long-range dependencies, and situate local narrative evidence within broader context windows. Such datasets, and associated evaluation protocols, are foundational for advancing narrative understanding tasks, improving retrieval-augmented generation (RAG) in literary domains, and assessing system performance on real-world queries requiring deep story comprehension.

1. Dataset Objectives and Design Principles

The primary objective of a book-plot retrieval dataset is to simulate realistic reading scenarios in which users seek to locate semantically relevant passages based on incomplete or highly abstracted queries (Xu et al., 2023, Wu et al., 3 Aug 2025). Unlike standard IR tasks that reward surface-level matches, this paradigm highlights:

  • Low lexical overlap: Queries usually paraphrase, summarize, or speculate about plots rather than quoting verbatim book text.
  • Abstract semantic association: Ground-truth query-chunk pairs are selected based on their indirect, high-level narrative connection.
  • Contextual dependency: The utility and interpretation of each candidate chunk depend on its surrounding narrative, necessitating a situated approach.

Dataset design typically involves the curation of query–passage pairs, careful selection of query sources (e.g., reader comments, user notes), and chunking strategies that reflect narrative continuity. For robust evaluation, datasets may include both manual annotation (to ensure the correctness and abstractness of associations) and measures to filter out pairs with excessive word overlap (Xu et al., 2023, Wu et al., 3 Aug 2025).

2. Data Sources, Construction, and Annotation

Book-plot retrieval datasets draw on a variety of sources:

  • User-generated content: Publicly available reader comments or book notes are used as queries, with the associated anchor in the text serving as the target chunk (Wu et al., 3 Aug 2025).
  • Direct manual annotation: Annotators select or validate plot-chunk pairs to ensure high semantic relevance and abstract association (Xu et al., 2023).
  • Filtering and preprocessing: Text pairs with over 50% token overlap between query and chunk are excluded to accentuate semantic—rather than lexical—matching.

Books are pre-processed by splitting text into consecutive sentence segments (e.g., 200-token windows); and then “situated context” windows are constructed, by grouping, e.g., 16 consecutive segments for each evaluated chunk, to capture the local and global storyline (Wu et al., 3 Aug 2025). Query–chunk pairs total in the hundreds of thousands, ensuring sufficient coverage for statistical analysis.

Annotation criteria focus on whether the pair exhibits indirect, semantically meaningful association (summarization, analysis, expression, vision, reminiscence) rather than direct restatement. Multi-rater validation and additional expert annotation may be used for reliability (Xu et al., 2023).

3. Retrieval Modeling Paradigms and Methodology

Book-plot retrieval datasets encourage the use of models that go beyond surface-level retrieval and integrate broader story context.

A. Context-aware Dense Embedding Models:

  • Models such as SitEmb-v1 and SitEmb-v1.5 represent chunks conditioned on their broader context. The approach uses two embedding models—a chunk-only baseline and a situated model—in a residual learning framework:
    • For each chunk, embeddings are computed for both the chunk alone and the chunk with context. They are combined additively as c~=cb+cs\tilde{c} = c^b + c^s and similarly for queries: q~=qb+qs\tilde{q} = q^b + q^s (Wu et al., 3 Aug 2025).
  • Training employs margin-based contrastive loss, sampling 10 negatives per query:

L(Θb,Θs)=1Ni=1Nmax(0,γ+sim(q~,c~i)sim(q~,c~+))\mathcal{L}(\Theta^{b}, \Theta^{s}) = \frac{1}{N} \sum_{i=1}^{N} \max\Big(0, \gamma + \text{sim}(\tilde{q}, \tilde{c}_i^-) - \text{sim}(\tilde{q}, \tilde{c}^+)\Big)

where γ\gamma is the margin, NN is the number of negatives.

B. Baseline Comparisons:

Models evaluated include:

  • Lexical approaches (BM25)
  • Sparse neural methods (DeepCT, SPARTA, docT5query)
  • Dense retrieval (DPR, ANCE, BERM, coCondenser, Contriever)
  • Late-interaction (ColBERT)
  • Cross-encoder re-rankers
  • LLM-based expansions (e.g., using ChatGPT to generate expanded queries), which did not prove effective for this task due to excessive generalization (Xu et al., 2023).

Results consistently indicate that cross-encoder models offer the best performance of previous techniques, but still lag substantially behind humans in capturing abstract semantic associations.

4. Evaluation Metrics and Performance Characteristics

Given the unique properties of narrative text, novel evaluation metrics are employed:

  • Normalized Relative Offset Discounted Cumulative Gain (N-RODCG):

RODCG@k=i=1kf(di)log(i+1)\text{RODCG}@k = \sum_{i=1}^{k} \frac{f(d_i)}{\log(i+1)}

where did_i is the minimal distance between the ii-th retrieved chunk and any ground-truth plot, and f(d)=1/(d+1)f(d) = 1/(d + 1) if d<αd < \alpha (α=5\alpha = 5), else $0$ (Xu et al., 2023).

The N-RODCG is normalized by the ideal value for each query.

  • Recall@k, MRR: Standard IR metrics are also reported.

Experimental results show that context-aware models leveraging situated embeddings (e.g., SitEmb-v1, SitEmb-v1.5) substantially outperform both chunk-only and long-context-only baselines. For instance, SitEmb-v1.5, based on an 8B parameter Qwen3-Embedding model, yields over 10% higher recall than leading commercial and academic baselines, confirming the critical advantage of situated chunk modeling (Wu et al., 3 Aug 2025).

However, even the best dense or cross-encoder models fall short of human performance, illustrating the persistent challenge of modeling abstract narrative associations (Xu et al., 2023).

5. Applications and Impact

Book-plot retrieval datasets have direct applications in:

  • Book discovery: Matching complex user queries with relevant plot events or segments, supporting advanced search in digital libraries, e-books, and reading apps.
  • RAG systems: Powering retrieval-augmented generation by linking queries to contextually accurate, semantically aligned narrative passages.
  • Semantic evaluation benchmarks: Functioning as diagnostic tools for semantic association modeling, highlighting the limits of current embedding architectures and training regimes.
  • Downstream story comprehension tasks: Enabling improved performance in claim verification, QA, recap snippet identification, and claim-evidence linkage in literary analysis (Wu et al., 3 Aug 2025, Li et al., 11 Feb 2024).

The datasets’ multilingual applicability and context-aware methodology suit global, scalable deployments. Their structure also enables the assessment of both local and global plot understanding.

6. Limitations, Open Questions, and Future Directions

Key limitations and future research avenues include:

  • Domain adaptation and generalization: Current models and datasets are optimized for narrative texts; transfer to informational or technical domains remains an open challenge (Wu et al., 3 Aug 2025).
  • Semantic spectrum control: There is a recognized need for controllable training objectives that can tune the model’s association spectrum, from direct (explicit relevance) to abstract (indirect association), potentially via enhanced instruction-based fine-tuning.
  • Efficiency and scalability: While residual learning alleviates the computational cost of long-context modeling, more efficient architectures may be required for global deployment at greater scale.
  • Annotation and benchmark expansion: Broader and more diverse datasets will afford a richer evaluation of model capabilities, particularly for multi-cultural, multi-genre, or cross-lingual scenarios.
  • Context window optimization: Determining context window size and structure for maximal retrieval quality without overwhelming model capacity is an ongoing research area.
  • Human–machine gap: Persistent, significant performance differences between top models and human raters suggest that deeper narrative and commonsense reasoning capabilities are required for true human-like retrieval.

This suggests that, while current models demonstrate tangible improvements in dealing with narrative context and abstract association, foundational advances in architecture, training signal design, and data diversity remain necessary to reach the level of human comprehension and relevance ranking in book-plot retrieval settings.