SPARQL-LLM: LLM-Enhanced SPARQL Queries

Updated 23 December 2025

SPARQL-LLM is a system that converts natural language questions into executable SPARQL queries by integrating LLMs with knowledge graph schema and metadata.
It employs retrieval-augmented generation with embedded Q/Q examples and schema snippets to minimize hallucinations and ensure query accuracy.
The approach uses automated validation and federated query decomposition to reliably execute multi-endpoint queries with over 90% F1 performance.

SPARQL-LLM translates natural language questions into executable SPARQL queries over Knowledge Graphs (KGs) by integrating LLMs with knowledge graph schema and metadata. It systematically combines retrieval-augmented generation (RAG), schema and example integration, prompt construction, automated validation, and iterative correction to support accurate and federated query generation, especially over bioinformatics knowledge graphs (Emonet et al., 8 Oct 2024).

1. System Architecture and Workflow

SPARQL-LLM is structured around four primary modules:

Indexing Module: Onboarding of SPARQL endpoints involves fetching their VoID descriptions (Vocabulary of Interlinked Datasets), extracting example question/query (Q/Q) pairs and schema information, generating embeddings, and constructing a vector database for similarity-based retrieval.
Retrieval Component: At runtime, the input question is embedded and used to retrieve the most similar example Q/Q pairs and class schema snippets from the vector index, based on cosine similarity between the question and indexed embeddings.
Generation Component: A prompt is synthesised for the LLM, including the user question, the retrieved examples, and the most relevant schema descriptions. The prompt is designed to enforce strict generation of valid SPARQL and federated queries.
Validation & Correction Module: Generated SPARQL is parsed and validated against the schemas of each endpoint. Invalid triple patterns—those using predicates or classes not allowed in the corresponding schema—get flagged, and the prompt is augmented with error feedback for regeneration until a valid query is produced or a maximum correction depth is reached.

The dataflow is as follows:

Onboarding: endpoints → VoID, Q/Q pairs, embeddings, vector index.
User query: embed, retrieve context, build prompt, call LLM.
Validation: parse query, type-check triples against schemas, flag errors.
Correction loop: prompt updated with error feedback, regeneration until success.
Execution: valid SPARQL with federated SERVICE blocks sent to respective endpoints; results joined and returned.

2. Retrieval-Augmented Generation and Metadata Integration

The RAG pipeline is formalized as:

$\text{Retrieval}(q) \to C = \{c_1,\ldots,c_k\},\quad \text{Generation}(\text{prompt}(q, C)) \to \hat{s}$

Embedding and Indexing:

Example Q/Q pairs from each endpoint are embedded using BAAI/bge-large-en-v1.5.
Class schemas (human-readable ShEx shapes with labels and comments) are also embedded, with both types indexed in Qdrant.

Retrieval Scoring:

Given the query embedding $e_q$ , cosine similarity $\mathrm{sim}(e_q, e_c)$ is used to select the top-20 examples and top-15 class schemas:

$\mathrm{sim}(e_q, e_c) = \frac{\langle e_q, e_c \rangle}{\|e_q\|\|e_c\|}$

$C_{\text{ex}} = \mathrm{arg\,top}_k^{\text{ex}}\,\mathrm{sim}(e_q, e_{q_i})$ and $C_{\text{sc}} = \mathrm{arg\,top}_k^{\text{sc}}\,\mathrm{sim}(e_q, e_{\text{sc}_j})$ .

Prompt Construction:

User block: "User question: [q]"
Example block: top- $k$ Q/Q pairs as "Relevant examples".
Schema block: top- $k$ class shapes as "Relevant schemas".
System instruction: enforce answer style and ban hallucination of predicates/classes.
Request: "Please write a SPARQL query."

3. Hallucination Mitigation and Query Validation

Prompt Constraints

The LLM receives strict instructions: only emit federated SPARQL that adheres to schema, do not invent classes or predicates, state "Cannot answer" if impossible.

Validation Module

Parse predicted SPARQL; for each triple $(X, p, Y)$ $(X, p, Y)$ in the pattern:
1. Assign (via SERVICE or VoID) to an endpoint.
2. If $X$ has explicit type $C$ , check $p \in \text{Predicates}(C)$ from ShEx.
3. If $p$ is invalid, generate an error message: "In endpoint $E$ , for class $C$ , predicate $p$ is invalid. Allowed: $\{p_1,\ldots,p_n\}$ ."
If errors are detected, the correction prompt is augmented and the LLM is re-invoked.
Correction iterates until success or maximum steps.

This automated correction is essential to reach accuracy levels above 90% F1, as illustrated in evaluation (Section 5).

4. Federated Query Decomposition and Execution

SPARQL-LLM natively supports federated queries across multiple endpoints:

Endpoint Management

Each endpoint must serve VoID metadata. CLI tools fetch and index VoID patterns.

Query Decomposition

Given a raw SPARQL, triple patterns are assigned to endpoints whose VoID covers them.
Patterns are grouped and inserted into SERVICE blocks:

$\text{For}\ t_i \in \text{triples},\ E(t_i) = \arg\max_E\,(\text{VoID}(E)\ \text{covers}\ t_i)$

The final query merges SERVICE block outputs for execution.

Workflow

Input question → embedding and retrieval → prompt → LLM → (raw) query → validation loop → federated rewriting (if needed) → endpoint execution.

5. Evaluation, Use Cases, and Error Modes

Empirical Results

Experiments on 13 held-out questions requiring reasoning across UniProt, Bgee, and OMA endpoints compared three approaches:

LLM only: F1 = 0.08 (3 of 39 queries correct).
RAG, no validation: F1 = 0.85 (33 correct, only 5 not returned, 1 error).
RAG + validation: F1 = 0.91 (34 correct, 3 plausible but not exact).

Key findings:

RAG yields large accuracy improvements.
Validation and correction is critical to achieve >90% F1 (Emonet et al., 8 Oct 2024).

Example

A representative example is: NL: "Which human proteins interact with BRCA1 and are annotated with disease X?" RAG retrieves examples/schemas, the LLM is prompted, generates federated SPARQL using SERVICE blocks for each endpoint, and validation ensures type/predicate conformity. Only when validation passes is the query executed.

Error Reduction

Most improvement stems from barring out-of-schema predicates and automated correction messaging.
Both gpt-4o and smaller models such as Mixtral, Llama3-8B benefit from the validation loop.

6. Extensibility and Real-World Deployment

SPARQL-LLM is published as open-source software (PyPI sparql-LLM package, GitHub #sparql-LLM) and is containerized for straightforward deployment:

Adaptability: The modular architecture (indexing, retrieval, prompting, validation, federated query rewriting) allows extension to KGs in any domain, as long as endpoints expose VoID or equivalent lightweight metadata.
Deployment: Powers bioinformatics QA at chat.expasy.org, supporting life-science researchers in querying decentralized KGs.
Integration: System can be wrapped as a microservice (GET /ask?dataset=...&question=...), returning query, results, and a natural language answer.
Reusability: Designed to support rapid extension or integration into other KGC ecosystems.

7. Significance and Best Practices

The SPARQL-LLM approach demonstrates that:

Embedding-based context selection and hybrid retrieval of schema/examples align LLM generations closely with real KG structure, minimizing hallucination.
Automated iterative validation and correction are essential for high semantic precision and recall, especially in federated and complex KGQA scenarios.
Explicit prompt constraints and schema-aware validation mitigate structural and semantic errors that otherwise plague naive LLM-to-SPARQL approaches.
Federated reasoning is handled natively via endpoint assignment and SERVICE decomposition: the system can support complex, multi-KG queries with automated service routing.

Sustained high accuracy (F1 > 0.9) and reduced operator overhead position SPARQL-LLM as a production-ready baseline for natural language question answering over distributed knowledge graphs (Emonet et al., 8 Oct 2024).

PDF Markdown Chat (Pro)

References (1)

LLM-based SPARQL Query Generation from Natural Language over Federated Knowledge Graphs (2024)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to SPARQL-LLM.