Talking to Data: Designing Smart Assistants for Humanities Databases

Published 1 Jun 2025 in cs.CL | (2506.00986v1)

Abstract: Access to humanities research databases is often hindered by the limitations of traditional interaction formats, particularly in the methods of searching and response generation. This study introduces an LLM-based smart assistant designed to facilitate natural language communication with digital humanities data. The assistant, developed in a chatbot format, leverages the RAG approach and integrates state-of-the-art technologies such as hybrid search, automatic query generation, text-to-SQL filtering, semantic database search, and hyperlink insertion. To evaluate the effectiveness of the system, experiments were conducted to assess the response quality of various LLMs. The testing was based on the Prozhito digital archive, which contains diary entries from predominantly Russian-speaking individuals who lived in the 20th century. The chatbot is tailored to support anthropology and history researchers, as well as non-specialist users with an interest in the field, without requiring prior technical training. By enabling researchers to query complex databases with natural language, this tool aims to enhance accessibility and efficiency in humanities research. The study highlights the potential of LLMs to transform the way researchers and the public interact with digital archives, making them more intuitive and inclusive. Additional materials are presented in GitHub repository: https://github.com/alekosus/talking-to-data-intersys2025.

Abstract PDF Upgrade to Chat

Summary

The paper introduces a novel hybrid retrieval and LLM-driven RAG system that integrates full-text and semantic search to navigate complex humanities databases.
It demonstrates improved retrieval precision and responsible generation with detailed evaluations, highlighting the benefits of dynamic query rewriting and text-to-SQL translation.
The approach offers practical insights for scalable digital humanities research by addressing both structured and unstructured data, with emphasis on accuracy, ethics, and modular design.

Smart Assistants for Humanities Databases: LLM-Driven RAG with Hybrid and Semantic Search

This paper presents a retrieval-augmented generation (RAG) system with a multimodal data access architecture, tailored specifically for navigating and extracting value from large, heterogeneous humanities databases containing both free-form text (e.g., diaries) and structured relational metadata. The motivating problem is the inaccessibility and analytic friction present in standard humanities data archives, caused by a mismatch between traditional search paradigms and the non-technical needs of domain researchers in history and anthropology.

System Architecture and Methodological Innovations

The assistant is implemented as a web-based, multi-turn chatbot that fuses several advances:

Hybrid Retrieval: The system utilizes both classical full-text (BM25/TF-IDF) and dense vector (semantic) retrieval via transformer-based encoders (E5, BGE-M3, OpenAI TE-3-Large). Their scores are linearly combined, weighted by parameter α, and normalized to [0, 1], allowing dynamic trade-off between precision of exact term match and recall in the presence of lexical variability.
Text-to-SQL Translation: For structured data queries (e.g., biographical metadata, dates), the system uses LLM-generated SQL based on in-context-prompted schema conditioning, with support for few-shot and chain-of-thought prompt patterns for coverage and self-correction.
Semantic Field Filtering: Short text attributes (e.g., professions, place names) are filtered using embedding similarity rather than lexical equality, addressing synonymy and aliasing (e.g., "Saint Petersburg", "Leningrad", "Petrograd").
Automated Query Reformulation: LLM-based history-aware query rewriting ensures user prompts lacking explicit referents (e.g., follow-ups) are mapped to effective search queries, mitigating the context-dependence of conversational interactions.
Source-Linked Generation: All generated responses include explicit hyperlinks to original database entries, supporting both answer verification and traceability—a critical requirement for expert humanities research.

Experimental Design and Empirical Results

The system was evaluated using a curated subset (125 entries, 50 questions, spanning 25 topics) of the "Prozhito" digital diary archive (60k+ Russian-language diary entries from 1900-1916). Both retrieval and generation components were scrutinized via annotated human evaluation.

Retrieval Performance

Retrieval models were benchmarked with Precision@5, comparing the proportion of true-positive fragments among the top-5 candidates returned per query.

Search Model	Precision@5
tf-idf	0.264
e5-large	0.528
e5-large + tf-idf	0.556
bge-m3	0.568
bge-m3 + tf-idf	0.556
te-3-large	0.548
te-3-large + tf-idf	0.572

The combination of semantic and full-text retrieval generally improved performance (+0.028 for e5-large, +0.024 for te-3-large), but decreased slightly for bge-m3 (-0.012), plausibly due to self-distillation of full-text features into the embedding model during bge-m3’s pretraining.
Semantic search dominated—full-text alone was insufficient for freeform, non-terminological diary text typical of humanities corpora.

Generation Performance

LLMs for answer synthesis were evaluated for factual Accuracy and Ethics (scale: 1–5 per criterion).

Model	Accuracy	Ethics
GPT-4o	4.23	4.43
o3-mini	4.00	4.46
DeepSeek-V3	4.54	4.40
DeepSeek-R1	4.51	4.24
Qwen-2.5	4.26	4.41

DeepSeek-V3 achieved the highest Accuracy.
o3-mini scored highest on Ethics, though inter-model variations were marginal.
Models performed strongly on question relevance and grammaticality, with most substantive errors arising from incorrect analysis of context or spurious inferences ("hallucinations") from source fragments, often involving fabrication, misattribution, or unsupported conclusions.

Ethical Robustness and Safety

A focused evaluation on "provocative" topics (e.g., weapons, drugs, self-harm, assassinations) demonstrated that, despite models flagging such contexts as sensitive, they nevertheless generated informative answers with disclaimers (e.g., referring to historical context). This highlights the persistent vulnerability of contemporary LLMs to "jailbreaking" via temporal (past-tense) reframing of harmful queries, in line with recent findings on LLM refusal robustness.

Implications and Limitations

Practical Implications:

This architecture demonstrates a viable route for deploying LLM-driven assistants in domain-specific archives requiring nuanced handling of both structured and unstructured data.
The modular pipeline—hybrid search, LLM question rewriting, text-to-SQL, and semantic field filtering—is generally transferable to other digital humanities, social science, or historical corpora, with only minor adaptation required for schema or language idiosyncrasies.

Theoretical Implications:

The hybrid search results emphasize the importance of understanding retrieval model training regimes; index redundancy (semantic + lexical) is not universally additive.
LLMs’ performance on both factual interpretation and ethical adherence is sensitive to both prompt structure and conversational context, with context-aware query rewriting necessary to mitigate context drift in dialogue.

Resource and Scaling Considerations:

Semantic retrieval at the scale of tens of thousands of diary entries is tractable with modern GPU-accelerated vector databases (e.g., FAISS, Milvus); scaling to order-of-magnitude larger archives will require distributed embedding stores and asynchronous retrieval pipelines.
LLM inference for question rewriting and answer generation demands high-availability LLM endpoints; open-source models now closely approach or match commercial offerings, particularly for chat and reasoning.

Limitations:

The Prozhito corpus lacks technical terminology and is morphologically diverse; results may not generalize to specialized or low-resource languages.
Text-to-SQL evaluation was limited to SELECT retrieval queries; more complex database manipulations were not assessed.
Annotation protocols for answer accuracy and ethics, while robust, leave open the issues of annotator bias and cross-linguistic generalizability.

Future Directions

Long-Context RAG: Incorporating LongLM or memory-augmented profiles to enable persistent, session-spanning context tracking.
Improved Jailbreak Resistance: Introducing response layer safety filtering, LLM fine-tuning with counter-jailbreak datasets, or integrating external refusal engines.
Dynamic User Modeling: Adapting retrieval/generation strategies conditioned on user expertise or intent.
Broader Database Support: Extending to multimodal archives (e.g., images, audio) via cross-modal retrieval and generation.

This work constitutes a substantive advance toward inclusive, natural-language-driven access to humanities archives, underlining both the promise and persistent challenges of LLM-based information assistants in high-context, safety-critical domains.