SPARQL-LLM: LLM-Enhanced SPARQL Queries
- SPARQL-LLM is a system that converts natural language questions into executable SPARQL queries by integrating LLMs with knowledge graph schema and metadata.
- It employs retrieval-augmented generation with embedded Q/Q examples and schema snippets to minimize hallucinations and ensure query accuracy.
- The approach uses automated validation and federated query decomposition to reliably execute multi-endpoint queries with over 90% F1 performance.
SPARQL-LLM translates natural language questions into executable SPARQL queries over Knowledge Graphs (KGs) by integrating LLMs with knowledge graph schema and metadata. It systematically combines retrieval-augmented generation (RAG), schema and example integration, prompt construction, automated validation, and iterative correction to support accurate and federated query generation, especially over bioinformatics knowledge graphs (Emonet et al., 8 Oct 2024).
1. System Architecture and Workflow
SPARQL-LLM is structured around four primary modules:
- Indexing Module: Onboarding of SPARQL endpoints involves fetching their VoID descriptions (Vocabulary of Interlinked Datasets), extracting example question/query (Q/Q) pairs and schema information, generating embeddings, and constructing a vector database for similarity-based retrieval.
- Retrieval Component: At runtime, the input question is embedded and used to retrieve the most similar example Q/Q pairs and class schema snippets from the vector index, based on cosine similarity between the question and indexed embeddings.
- Generation Component: A prompt is synthesised for the LLM, including the user question, the retrieved examples, and the most relevant schema descriptions. The prompt is designed to enforce strict generation of valid SPARQL and federated queries.
- Validation & Correction Module: Generated SPARQL is parsed and validated against the schemas of each endpoint. Invalid triple patterns—those using predicates or classes not allowed in the corresponding schema—get flagged, and the prompt is augmented with error feedback for regeneration until a valid query is produced or a maximum correction depth is reached.
The dataflow is as follows:
- Onboarding: endpoints → VoID, Q/Q pairs, embeddings, vector index.
- User query: embed, retrieve context, build prompt, call LLM.
- Validation: parse query, type-check triples against schemas, flag errors.
- Correction loop: prompt updated with error feedback, regeneration until success.
- Execution: valid SPARQL with federated SERVICE blocks sent to respective endpoints; results joined and returned.
2. Retrieval-Augmented Generation and Metadata Integration
The RAG pipeline is formalized as:
Embedding and Indexing:
- Example Q/Q pairs from each endpoint are embedded using BAAI/bge-large-en-v1.5.
- Class schemas (human-readable ShEx shapes with labels and comments) are also embedded, with both types indexed in Qdrant.
Retrieval Scoring:
- Given the query embedding , cosine similarity is used to select the top-20 examples and top-15 class schemas:
- and .
Prompt Construction:
- User block: "User question: [q]"
- Example block: top- Q/Q pairs as "Relevant examples".
- Schema block: top- class shapes as "Relevant schemas".
- System instruction: enforce answer style and ban hallucination of predicates/classes.
- Request: "Please write a SPARQL query."
3. Hallucination Mitigation and Query Validation
Prompt Constraints
- The LLM receives strict instructions: only emit federated SPARQL that adheres to schema, do not invent classes or predicates, state "Cannot answer" if impossible.
Validation Module
- Parse predicted SPARQL; for each triple in the pattern:
- Assign (via SERVICE or VoID) to an endpoint.
- If has explicit type , check from ShEx.
- If is invalid, generate an error message: "In endpoint , for class , predicate is invalid. Allowed: ."
If errors are detected, the correction prompt is augmented and the LLM is re-invoked.
- Correction iterates until success or maximum steps.
This automated correction is essential to reach accuracy levels above 90% F1, as illustrated in evaluation (Section 5).
4. Federated Query Decomposition and Execution
SPARQL-LLM natively supports federated queries across multiple endpoints:
Endpoint Management
- Each endpoint must serve VoID metadata. CLI tools fetch and index VoID patterns.
Query Decomposition
- Given a raw SPARQL, triple patterns are assigned to endpoints whose VoID covers them.
- Patterns are grouped and inserted into SERVICE blocks:
- The final query merges SERVICE block outputs for execution.
Workflow
- Input question → embedding and retrieval → prompt → LLM → (raw) query → validation loop → federated rewriting (if needed) → endpoint execution.
5. Evaluation, Use Cases, and Error Modes
Empirical Results
Experiments on 13 held-out questions requiring reasoning across UniProt, Bgee, and OMA endpoints compared three approaches:
- LLM only: F1 = 0.08 (3 of 39 queries correct).
- RAG, no validation: F1 = 0.85 (33 correct, only 5 not returned, 1 error).
- RAG + validation: F1 = 0.91 (34 correct, 3 plausible but not exact).
Key findings:
- RAG yields large accuracy improvements.
- Validation and correction is critical to achieve >90% F1 (Emonet et al., 8 Oct 2024).
Example
A representative example is: NL: "Which human proteins interact with BRCA1 and are annotated with disease X?" RAG retrieves examples/schemas, the LLM is prompted, generates federated SPARQL using SERVICE blocks for each endpoint, and validation ensures type/predicate conformity. Only when validation passes is the query executed.
Error Reduction
- Most improvement stems from barring out-of-schema predicates and automated correction messaging.
- Both gpt-4o and smaller models such as Mixtral, Llama3-8B benefit from the validation loop.
6. Extensibility and Real-World Deployment
SPARQL-LLM is published as open-source software (PyPI sparql-LLM package, GitHub #sparql-LLM) and is containerized for straightforward deployment:
- Adaptability: The modular architecture (indexing, retrieval, prompting, validation, federated query rewriting) allows extension to KGs in any domain, as long as endpoints expose VoID or equivalent lightweight metadata.
- Deployment: Powers bioinformatics QA at chat.expasy.org, supporting life-science researchers in querying decentralized KGs.
- Integration: System can be wrapped as a microservice (GET /ask?dataset=...&question=...), returning query, results, and a natural language answer.
- Reusability: Designed to support rapid extension or integration into other KGC ecosystems.
7. Significance and Best Practices
The SPARQL-LLM approach demonstrates that:
- Embedding-based context selection and hybrid retrieval of schema/examples align LLM generations closely with real KG structure, minimizing hallucination.
- Automated iterative validation and correction are essential for high semantic precision and recall, especially in federated and complex KGQA scenarios.
- Explicit prompt constraints and schema-aware validation mitigate structural and semantic errors that otherwise plague naive LLM-to-SPARQL approaches.
- Federated reasoning is handled natively via endpoint assignment and SERVICE decomposition: the system can support complex, multi-KG queries with automated service routing.
Sustained high accuracy (F1 > 0.9) and reduced operator overhead position SPARQL-LLM as a production-ready baseline for natural language question answering over distributed knowledge graphs (Emonet et al., 8 Oct 2024).