Papers
Topics
Authors
Recent
Search
2000 character limit reached

DepsRAG: Dual RAG Systems in Biology & Software

Updated 22 June 2026
  • DepsRAG is a dual system leveraging retrieval-augmented generation to support both macromolecular structure deposition and software dependency management.
  • In structural biology, it powers the RCSB PDB AI Help Desk with citation-backed, policy-driven guidance via a robust four-tier architecture.
  • In software engineering, it utilizes multi-agent reasoning and a knowledge graph to dynamically resolve dependency queries through structured retrieval.

DepsRAG refers to two distinct systems that leverage Retrieval-Augmented Generation (RAG) architectures in specialized domains: (1) as the internal designation for the RCSB PDB AI Help Desk supporting macromolecular structure deposition in structural biology (Chithari et al., 13 Apr 2026), and (2) as a multi-agent reasoning and planning framework for software dependency management (Alhanahnah et al., 2024). Both implementations share core RAG principles but differ in purpose, architecture, and information retrieval strategies.

1. Definitions and Domains

Structural Biology: RCSB PDB AI Help Desk (“DepsRAG”)

DepsRAG in this context is the internal term for a RAG-based, citation-backed conversational agent deployed by the RCSB Protein Data Bank. It addresses depositor queries using procedural, policy, and validation knowledge extracted from institutional documents, offering guidance to structural biologists involved in deposition workflows (Chithari et al., 13 Apr 2026).

Software Engineering: Multi-Agent Reasoning with RAG

In software dependency management, DepsRAG denotes a multi-agent LLM-powered assistant that builds and operates over a knowledge graph of direct and transitive package dependencies. Developers issue natural-language queries concerning dependency attributes, and DepsRAG orchestrates graph-centric, retrieval-augmented reasoning—integrating facts from the knowledge graph, web sources, and error-driven feedback loops (Alhanahnah et al., 2024).

2. System Architectures and Data Flow

RCSB PDB AI Help Desk

DepsRAG is organized into four logical tiers:

  • Presentation Layer: Web chat interface.
  • Application Layer: FastAPI/uvicorn REST and Server-Sent Events (SSE) endpoints, session management.
  • Inference Layer: RAG pipeline orchestrated by LangChain.
  • Knowledge Base Layer: PostgreSQL with pgvector for dense vector indexing of document chunks.

Key architectural features include:

  • Document embeddings reside in a schema-separated vector store.
  • Session consistency is enforced via a two-phase commit, handling interruptions robustly.
  • Streaming response generation minimizes user latency and supports live interaction (Chithari et al., 13 Apr 2026).

Software Dependency Management

DepsRAG uses a multi-agent architecture based on Langroid, incorporating:

  • DepsRAG Agent (LLM instance): Determines workflow phases and issues queries.
  • KG Retriever: Neo4j-backed entity for knowledge graph schema and Cypher query execution.
  • Web Search Retriever: Performs external search when the graph does not suffice.
  • Utility Tools: ConstructKGTool builds the dependency knowledge graph using Deps.Dev API; VisualizeKGTool renders visual graphs.

The pipeline is inherently conversational and agentic, dynamically triggering construction and retrieval operations based on user dialogue context (Alhanahnah et al., 2024).

3. Retrieval-Augmented Generation and Knowledge Representation

RCSB PDB AI Help Desk

DepsRAG constructs embeddings from institutionally controlled PDFs, using a two-phase chunking algorithm:

  1. MarkdownTextSplitter: Splits documents on Markdown structural elements to align chunks with coherent sections.
  2. RecursiveCharacterTextSplitter: Further segments text to conform to length and overlap constraints, preserving cross-sentence context.

At query time, a Maximal Marginal Relevance (MMR) strategy selects k=8k=8 chunks from a candidate pool of $30$, maximizing both relevance and mutual diversity:

Di=argmaxdCSi1[λSim(vq,vd)(1λ)maxdSi1Sim(vd,vd)]D_i = \arg\max_{d \in C \setminus S_{i-1}} \big[ \lambda \cdot \mathrm{Sim}(v_q, v_d) - (1-\lambda) \cdot \max_{d' \in S_{i-1}} \mathrm{Sim}(v_{d'}, v_d) \big]

with cosine similarity, λ=0.7\lambda=0.7 (Chithari et al., 13 Apr 2026).

Software Dependency Management

DepsRAG builds a knowledge graph with entities (package name,version)(\text{package name}, \text{version}) and edges labeled depends_on:

KG={(ei,depends_on,ej)ei,ejE}KG = \{ (e_i, \text{depends\_on}, e_j) \mid e_i, e_j \in \mathbb{E} \}

Natural-language queries are translated into Cypher graph queries via LLM prompt engineering, and results are retrieved via API calls. If the graph alone cannot answer a query, a Web search is invoked, and both results are merged for answer generation. Unlike vector-based approaches, retrieval here is structured and non-embedding-centric (Alhanahnah et al., 2024).

4. Guardrails, Prompt Engineering, and Error Mitigation

RCSB PDB AI Help Desk

Multiple robust guardrails enforce accuracy, privacy, and domain appropriateness:

  • Topical Filtering: Queries are pre-classified by an LLM for scope compliance, with prompt refusals on out-of-domain questions.
  • System Prompt Protections: Prompts explicitly forbid biocurator-only jargon and internal identifiers in outputs. Staff names, emails, internal workflow codes, and other restricted content are also categorically blocked.
  • Citation-Backed Responses: All substantive answers are required to provide inline citations to their source documentation.
  • Dual-LLM Pattern: Separate LLM instances handle query condensation and end-user QA, isolating context reformulation from knowledge-grounded answer generation (Chithari et al., 13 Apr 2026).

Software Dependency Management

  • Schema Error Propagation: When LLM-generated Cypher queries fail or return errors, these are surfaced within the workflow, enabling automatic re-querying or schema introspection.
  • No Formal Critic-Agent: The current version relies on error feedback, but future designs anticipate explicit critic agents to cross-validate LLM-generated answers against KG invariants and trigger corrective cycles (Alhanahnah et al., 2024).

5. Evaluation and Metrics

RCSB PDB AI Help Desk

  • Knowledge Base Build Time: Less than 2 minutes for ~183 chunks (~92,000 tokens).
  • Deterministic Outputs: Enabled by temperature zero throughout, producing reproducible results.
  • Citation Fidelity: 100% of substantive claims are citation-backed.
  • Operational Metrics: Query-to-first-token latency is on the order of seconds; throughput is capped to 10 requests/min per chat endpoint.
  • Robustness: Embedding success rate is 100%; ingestion pipeline self-recovers from vector-store outages, polling at 10-second intervals (Chithari et al., 13 Apr 2026).

Software Dependency Management

Evaluation comprises multi-step reasoning benchmarks on real-world dependency graphs:

Reasoning Task #Steps GPT-4-Turbo Correct Llama-3 Correct
Graph depth 1 Yes Yes (1 retry)
Cycle detection 1 Yes Yes
Path-chain count 1 Yes No (incorrect, 2 retries)
Graph density 2 Yes No (used undirected formula)

GPT-4-Turbo attained 100% task accuracy; Llama-3 struggled with multi-step graph density computations (Alhanahnah et al., 2024).

6. Limitations, Open Challenges, and Future Directions

RCSB PDB AI Help Desk

  • Scope: Restricted to static institutional documentation; real-time, record-specific, or personalized queries are deferred to human operators.
  • Knowledge Updates: Maintenance burden requires ongoing ingestion of updated procedure or policy documents to avoid information gaps.
  • Future Enhancements: Plans include API-based integration for live data, broader corpus coverage across archival domains, AI-powered annotation workflow tools, and active learning via user feedback loop (Chithari et al., 13 Apr 2026).

Software Dependency Management

  • Knowledge Graph Incompleteness: The Deps.Dev API omits optional/intra-ecosystem dependencies; knowledge graphs may lack relevant nodes/edges.
  • Lack of Vulnerability Meta-data: Current KGs do not annotate nodes/edges with known security vulnerabilities; integration of CVE feeds is proposed.
  • Absence of Critic Agents: No explicit feedback agents for semantic validation leads to possible uncorrected LLM errors.
  • Elaboration Paths: Future work involves parsing local manifests for finer-grained dependency capture, automated SBOM generation, conflict-free upgrade suggestions, and integration of real-time vulnerability databases (Alhanahnah et al., 2024).

7. Comparative Context and Significance

DepsRAG exemplifies the trend of tailoring RAG architectures to highly specialized domains with complex procedural or structural information needs, using strong guardrail mechanisms and targeted retrieval strategies. In structural biology, this enables scalable, citation-backed support for expert-driven data curation. In software engineering, it supports comprehensive, conversational dependency graph reasoning with selective expansion to external sources and agentic feedback. Both implementations align with broader requirements for trustworthy, grounded LLM output and ongoing operational oversight.

Notably, these architectures contrast with document- or example-centric RAG models such as those leveraging speech timing information in clinical LLM diagnosis (Zhang et al., 16 Feb 2025) or those addressing differentially private synthetic RAG databases (Mori et al., 8 Oct 2025), illustrating the diversity of RAG instantiations across research domains.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to DepsRAG.