- The paper introduces a novel hybrid retrieval and LLM-driven RAG system that integrates full-text and semantic search to navigate complex humanities databases.
- It demonstrates improved retrieval precision and responsible generation with detailed evaluations, highlighting the benefits of dynamic query rewriting and text-to-SQL translation.
- The approach offers practical insights for scalable digital humanities research by addressing both structured and unstructured data, with emphasis on accuracy, ethics, and modular design.
Smart Assistants for Humanities Databases: LLM-Driven RAG with Hybrid and Semantic Search
This paper presents a retrieval-augmented generation (RAG) system with a multimodal data access architecture, tailored specifically for navigating and extracting value from large, heterogeneous humanities databases containing both free-form text (e.g., diaries) and structured relational metadata. The motivating problem is the inaccessibility and analytic friction present in standard humanities data archives, caused by a mismatch between traditional search paradigms and the non-technical needs of domain researchers in history and anthropology.
System Architecture and Methodological Innovations
The assistant is implemented as a web-based, multi-turn chatbot that fuses several advances:
- Hybrid Retrieval: The system utilizes both classical full-text (BM25/TF-IDF) and dense vector (semantic) retrieval via transformer-based encoders (E5, BGE-M3, OpenAI TE-3-Large). Their scores are linearly combined, weighted by parameter α, and normalized to [0, 1], allowing dynamic trade-off between precision of exact term match and recall in the presence of lexical variability.
- Text-to-SQL Translation: For structured data queries (e.g., biographical metadata, dates), the system uses LLM-generated SQL based on in-context-prompted schema conditioning, with support for few-shot and chain-of-thought prompt patterns for coverage and self-correction.
- Semantic Field Filtering: Short text attributes (e.g., professions, place names) are filtered using embedding similarity rather than lexical equality, addressing synonymy and aliasing (e.g., "Saint Petersburg", "Leningrad", "Petrograd").
- Automated Query Reformulation: LLM-based history-aware query rewriting ensures user prompts lacking explicit referents (e.g., follow-ups) are mapped to effective search queries, mitigating the context-dependence of conversational interactions.
- Source-Linked Generation: All generated responses include explicit hyperlinks to original database entries, supporting both answer verification and traceability—a critical requirement for expert humanities research.
Experimental Design and Empirical Results
The system was evaluated using a curated subset (125 entries, 50 questions, spanning 25 topics) of the "Prozhito" digital diary archive (60k+ Russian-language diary entries from 1900-1916). Both retrieval and generation components were scrutinized via annotated human evaluation.
Retrieval models were benchmarked with Precision@5, comparing the proportion of true-positive fragments among the top-5 candidates returned per query.
| Search Model |
Precision@5 |
| tf-idf |
0.264 |
| e5-large |
0.528 |
| e5-large + tf-idf |
0.556 |
| bge-m3 |
0.568 |
| bge-m3 + tf-idf |
0.556 |
| te-3-large |
0.548 |
| te-3-large + tf-idf |
0.572 |
- The combination of semantic and full-text retrieval generally improved performance (+0.028 for e5-large, +0.024 for te-3-large), but decreased slightly for bge-m3 (-0.012), plausibly due to self-distillation of full-text features into the embedding model during bge-m3’s pretraining.
- Semantic search dominated—full-text alone was insufficient for freeform, non-terminological diary text typical of humanities corpora.
LLMs for answer synthesis were evaluated for factual Accuracy and Ethics (scale: 1–5 per criterion).
- DeepSeek-V3 achieved the highest Accuracy.
- o3-mini scored highest on Ethics, though inter-model variations were marginal.
- Models performed strongly on question relevance and grammaticality, with most substantive errors arising from incorrect analysis of context or spurious inferences ("hallucinations") from source fragments, often involving fabrication, misattribution, or unsupported conclusions.
Ethical Robustness and Safety
A focused evaluation on "provocative" topics (e.g., weapons, drugs, self-harm, assassinations) demonstrated that, despite models flagging such contexts as sensitive, they nevertheless generated informative answers with disclaimers (e.g., referring to historical context). This highlights the persistent vulnerability of contemporary LLMs to "jailbreaking" via temporal (past-tense) reframing of harmful queries, in line with recent findings on LLM refusal robustness.
Implications and Limitations
Practical Implications:
- This architecture demonstrates a viable route for deploying LLM-driven assistants in domain-specific archives requiring nuanced handling of both structured and unstructured data.
- The modular pipeline—hybrid search, LLM question rewriting, text-to-SQL, and semantic field filtering—is generally transferable to other digital humanities, social science, or historical corpora, with only minor adaptation required for schema or language idiosyncrasies.
Theoretical Implications:
- The hybrid search results emphasize the importance of understanding retrieval model training regimes; index redundancy (semantic + lexical) is not universally additive.
- LLMs’ performance on both factual interpretation and ethical adherence is sensitive to both prompt structure and conversational context, with context-aware query rewriting necessary to mitigate context drift in dialogue.
Resource and Scaling Considerations:
- Semantic retrieval at the scale of tens of thousands of diary entries is tractable with modern GPU-accelerated vector databases (e.g., FAISS, Milvus); scaling to order-of-magnitude larger archives will require distributed embedding stores and asynchronous retrieval pipelines.
- LLM inference for question rewriting and answer generation demands high-availability LLM endpoints; open-source models now closely approach or match commercial offerings, particularly for chat and reasoning.
Limitations:
- The Prozhito corpus lacks technical terminology and is morphologically diverse; results may not generalize to specialized or low-resource languages.
- Text-to-SQL evaluation was limited to SELECT retrieval queries; more complex database manipulations were not assessed.
- Annotation protocols for answer accuracy and ethics, while robust, leave open the issues of annotator bias and cross-linguistic generalizability.
Future Directions
- Long-Context RAG: Incorporating LongLM or memory-augmented profiles to enable persistent, session-spanning context tracking.
- Improved Jailbreak Resistance: Introducing response layer safety filtering, LLM fine-tuning with counter-jailbreak datasets, or integrating external refusal engines.
- Dynamic User Modeling: Adapting retrieval/generation strategies conditioned on user expertise or intent.
- Broader Database Support: Extending to multimodal archives (e.g., images, audio) via cross-modal retrieval and generation.
This work constitutes a substantive advance toward inclusive, natural-language-driven access to humanities archives, underlining both the promise and persistent challenges of LLM-based information assistants in high-context, safety-critical domains.