Retrieval-Augmented Question Answering over Scientific Literature for the Electron-Ion Collider

Published 2 Apr 2026 in hep-ex, cs.AI, and physics.ins-det | (2604.02259v1)

Abstract: To harness the power of LLMs in answering domain specific specialized technical questions, Retrieval Augmented Generation (RAG) is been used widely. In this work, we have developed a Q&A application inspired by the Retrieval Augmented Generation (RAG), which is comprised of an in-house database indexed on the arXiv articles related to the Electron-Ion Collider (EIC) experiment - one of the largest international scientific collaboration and incorporated an open-source LLaMA model for answer generation. This is an extension to it's proceeding application built on proprietary model and Cloud-hosted external knowledge-base for the EIC experiment. This locally-deployed RAG-system offers a cost-effective, resource-constraint alternative solution to build a RAG-assisted Q&A application on answering domain-specific queries in the field of experimental nuclear physics. This set-up facilitates data-privacy, avoids sending any pre-publication scientific data and information to public domain. Future improvement will expand the knowledge base to encompass heterogeneous EIC-related publications and reports and upgrade the application pipeline orchestration to the LangGraph framework.

Abstract PDF Upgrade to Chat

Authors (3)

Summary

The paper introduces an on-premises RAG pipeline that grounds LLM responses in curated EIC scientific literature.
It employs fine-grained text chunking, 1024-dimensional embeddings, and cosine similarity with MMR to achieve accurate contextual retrieval.
Evaluations reveal high context recall and faithfulness with challenges in entity recall and answer correctness, guiding future improvements.

Retrieval-Augmented Question Answering System for EIC Scientific Literature

Introduction

This work proposes and evaluates an on-premises, open-source Retrieval-Augmented Generation (RAG) pipeline for question answering over arXiv scientific literature pertaining to the Electron-Ion Collider (EIC) experiment (2604.02259). RAG is leveraged to ground LLM responses in curated scientific contexts, mitigating the typical hallucination errors observed in pretrained LMs. The deployment targets the EIC collaboration—an international experiment producing a substantial corpus of technical and scientific publications—by facilitating accurate, context-aware, and citation-traceable responses to detailed queries. The design prioritizes resource efficiency, privacy (eschewing cloud dependencies and external APIs), and transparent traceability of generated responses to their foundational sources.

System Architecture

The architecture encompasses a modular pipeline consisting of article ingestion, fine-grained document chunking, embedding generation, vector storage and retrieval, prompt formation, and context-conditioned answer generation via LLMs.

Figure 1: Schematic overview of the RAG-based QA system, detailing ingestion, chunking, vectorization, ChromaDB indexing, retrieval, prompt assembly, and response generation.

The system ingests 178 arXiv articles spanning EIC phenomenology, hardware, and computational developments. Texts are segmented with the RecursiveCharacterTextSplitter (120 or 180-character segments, with a 20-character overlap), an approach aligned with prior literature emphasizing the trade-off between information coverage and relevance granularity. These chunks and associated metadata (arXiv ID, authors, publication year) are embedded into 1024-dimensional vectors using the mxbai-embed-large transformer model, achieving strong semantic fidelity and supporting local deployment. Of the multiple vector-store technologies examined (FAISS, Pinecone, LanceDB, ChromaDB), ChromaDB was selected for its privacy, efficiency, and compatibility with LangChain orchestration.

Query retrieval utilizes both cosine similarity and Maximum Marginal Relevance (MMR); top-20 most relevant chunks are incorporated into a synthesized prompt delivered to LLaMA LLM variants (3.2 and 3.3). A guardrail-enforced prompt formulation restricts answer generation strictly to retrieved contexts, with every response undergoing context traceability via LangSmith.

Latency Characterization

The study benchmarks the latency distribution at both the retrieval and generation steps, dissecting the impact of chunk size, retrieval algorithm (Cosine, MMR), and LLaMA model size.

Figure 2: (Left) Retrieval latency comparison between 120 and 180-character chunk sizes; (Right) Latency distribution for Cosine similarity versus MMR retrieval.

The empirical evaluation indicates that retrieval latency is effectively invariant under both chunk size (median ≈0.11s for 120 and 180) and retrieval metric, though wider chunk sizes introduce slightly greater variance. In sharp contrast, generation latency is highly sensitive to model scale:

Figure 3: Distribution of response generation delay for LLaMA 3.2 versus LLaMA 3.3 models.

LLaMA 3.3, while potentially more capable, introduces an order-of-magnitude increase in inference time and greater variance, rendering it less suitable for interactive QA deployment under resource constraints. LLaMA 3.2 provides a stable, narrow latency profile (median 10–20s).

Evaluation: Retrieval-Augmented Generation Assessment

The system’s output is rigorously evaluated against an expert-vetted gold-standard dataset (AI4EIC2023_DATASETS, 51 questions with multi-claim structure). RAG Assessment (RAGAS) metrics—Context Entity Recall, Context Precision, Context Recall, Answer Relevance, Correctness, and Faithfulness—are computed across chunking and retrieval strategies.

Figure 4: RAGAS benchmark distributions for chunk size 120, comparing Cosine and MMR retrieval across six evaluation metrics.

Figure 5: RAGAS scores for chunk size 180; upper panel: Cosine similarity, lower: MMR.

Robust trends are observed:

Context Recall consistently clusters near 1.0, especially for 180-character chunks, indicating high recall of ground-truth claims due to improved context preservation.
Context Precision and Entity Recall remain modest, with broad bimodality and moderate performance. This likely reflects challenges for general-purpose embedding models in capturing domain-specific scientific entities.
Faithfulness and Answer Relevance both improve with chunk size, evidencing more comprehensive factual alignment in generated answers as larger segments encompass more supporting evidence.
Answer Correctness lags behind, with notably low scores, highlighting the inherent difficulty of capturing precise factual details important in the EIC domain and the limitations of lightweight LLaMA variants.

Importantly, MMR fails to produce clear performance gains over cosine retrieval within this domain-specific, moderate-size corpus.

Implications and Future Directions

The demonstrated methodology provides a practical blueprint for domain-grounded scientific QA with controlled data governance. By eschewing proprietary LMs and cloud-hosted vector DBs, it lays a foundation for collaboration-centric, privacy-preserving research assistance. Several salient insights arise:

Model scaling has greater impact on system responsiveness than marginal retrieval design variations, emphasizing the need for careful resource/model size calibration in practical deployments.
Chunk size must balance between recall and precision, with larger chunks advisable given the document structure and QA needs typical in experimental physics literature.
The moderate RAGAS correctness and precision scores signal a need for science-specialized embedding models and more robust entity-aware retrieval architectures.

Future work will focus on expanding the knowledge base to include presentations, wikis, and technical reports, as well as orchestrating the pipeline via LangGraph to enable more flexible control and monitoring. Incorporating open-source LMs with stronger performance on long, technical context and developing advanced entity-aware embedding schemes are promising research avenues.

Conclusion

The work systematically validates open-source, locally deployed RAG QA for EIC scholarly content, achieving high context recall and faithfulness but revealing ongoing bottlenecks in entity recall and factual precision. The results underscore the maturity of RAG for specialized scientific collaborations under privacy and resource constraints, but also delineate clear pathways for enhancing factual grounding and latency. This positions the approach as a scalable foundation for future research assistant tools in large-scale, data-intensive scientific workflows.

Markdown Report Issue