Papers
Topics
Authors
Recent
2000 character limit reached

DataChat: Graph-Driven Conversational Data Search

Updated 19 November 2025
  • DataChat is a hybrid conversational system that combines graph representation with large language models to translate natural language into executable Cypher queries.
  • It integrates a Neo4j-based ICPSR repository, a dialog manager with prompt engineering, and a Streamlit interface to support both chat and interactive visualizations.
  • Its property-based ranking and multi-hop graph traversal enable efficient exploration of complex research datasets while addressing scalability and transparency challenges.

DataChat is a prototype conversational system designed to facilitate dataset search and exploration through natural language interaction, integrating a graph-based knowledge representation with LLM-driven Cypher query generation. Built atop a subset of the Inter-university Consortium for Political and Social Research (ICPSR) Scientific Knowledge Graph (ICPSR-SKG), the system leverages a Neo4j graph backend, an OpenAI GPT-3.5-turbo LLM module, and a Streamlit-based user interface to present chat-style answers or interactive visualizations. DataChat exemplifies a hybrid approach marrying structured graph representation and state-of-the-art LLMs for enhanced research data discovery (Fan et al., 2023).

1. System Components and Architectural Overview

DataChat comprises five primary components: (1) the Neo4j ICPSR-SKG graph database, (2) a conversational agent module implemented with GPT-3.5-turbo (accessed via the OpenAI API), (3) a dialog manager with prompt engineering logic, (4) a search and ranking pipeline that manages Cypher query generation, execution, and result ordering, and (5) a visualization and interface layer based on Streamlit and streamlit-agraph.

The system workflow can be summarized as:

  1. User submits a natural language (NL) query via the Streamlit web UI.
  2. Streamlit forwards the NL query to the backend dialog manager.
  3. The dialog manager constructs a Cypher prompt by concatenating fixed prompt examples and the user’s question, transmitting this to GPT-3.5-turbo.
  4. The LLM output is parsed as a Cypher query for the ICPSR-SKG schema.
  5. Neo4j executes the Cypher, returning matched nodes and edges encoded as JSON.
  6. The frontend either displays the answer in chat format (with Cypher query visible) or visualizes a subgraph interactively.

This modular pipeline supports both information retrieval and multi-hop graph queries over a heterogeneous research data ecosystem (Fan et al., 2023).

2. Component Functionality and Interactions

2.1 Streamlit-Focused Front End

The interface exposes two principal modalities: a "DataChatBot" tab facilitating Q&A in a chat format (including the underlying Cypher query) and a "DataChatViz" tab rendering interactive subgraphs (typically constrained to 3–5 datasets for clarity). User queries, LLM-generated code, and graph-based outputs are presented to enhance interpretability and support iterative refinement.

2.2 Dialog Manager: Prompt Engineering and Control Flow

The dialog manager maintains a small bank of NL-to-Cypher exemplars, which constitute the fixed-prompt template. On each turn, it synthesizes the input prompt as follows:

1
2
3
4
5
Translate the following natural language question into a Cypher query for the ICPSR-SKG.
Examples:
Q: 'What are the top 5 most cited datasets not owned by ICPSR?'
A: MATCH (a:Dataset) WHERE a.owner <> 'ICPSR' RETURN ... ORDER BY a.dataRefCount DESC LIMIT 5
Now translate: Q: '<user question>' A:

The augmented prompt is transmitted to the OpenAI API. The resultant Cypher code is then parsed and executed against the database; errors trigger user-facing diagnostic messages (Fan et al., 2023).

2.3 LLM Module

GPT-3.5-turbo processes each prompt anew, conditioned on the provided exemplars and the current NL query, with no session-wide context or long-term memory. LLM outputs are strictly single-turn, which suggests that stateful dialogue (e.g., complex clarifications, follow-ups) is not natively supported.

2.4 Neo4j Graph Database and ICPSR-SKG

The ICPSR-SKG graph contains 1,642 Dataset nodes and six related node types. Queries are served via the HTTP Bolt driver or the REST API and return results in JSON. The database encodes nodes for Dataset, Publication, Term, Owner, Funder, Series, and Location; key relationship types include:

  • (Dataset) –[:HAS_TERM]→ (Term)
  • (Dataset) –[:HAS_OWNER]→ (Owner)
  • (Dataset) –[:HAS_FUNDER]→ (Funder)
  • (Dataset) –[:HAS_SERIES]→ (Series)
  • (Dataset) –[:HAS_LOCATION]→ (Location)
  • (Publication) –[:CITED_BY]→ (Dataset)

Indexes support efficient lookup on node identifiers and common name attributes. Full-text search can be optionally enabled on name fields (Fan et al., 2023).

3. Graph Schema and Data Representation

3.1 Node and Edge Semantics

The ICPSR-SKG schema is characterized by semantically typed nodes capturing both core datasets and contextual metadata. Dataset nodes are richly annotated with properties such as:

  • id: string
  • name: string
  • date: YYYY or YYYY–YYYY
  • url: DOI
  • totalUserCount: integer
  • dataUserCount: integer
  • dataRefCount: integer

Other entities (Publication, Term, Owner, Funder, Series, Location) follow similar labeling conventions, providing attributes for display, filtering, and ranking.

3.2 Indexing, Search, and Expansion

Neo4j property indexing is established on high-utility attributes (Dataset.id, Term.name, Owner.name, etc.) to accelerate direct and substring queries. Although vector embeddings and advanced similarity search are not employed, the graph structure enables Cypher-based topological queries and multi-hop traversals for complex information needs.

3.3 Visualization

Results can be rendered as node-edge lists and visualized using streamlit-agraph, with node color encoding type and edges reflecting semantic relationships. Interactive features permit node drag, attribute highlighting, and shared attribute centering, supporting both individual object inspection and local neighborhood expansion.

4. Retrieval, Ranking, and Query Translation Algorithmics

4.1 Direct Property-Based Ranking

DataChat’s retrieval and ranking are strictly property-driven, eschewing learned similarity models. Results are ordered with Cypher’s ORDER BY operator using properties such as dataRefCount (citation count), dataUserCount (downloads), and date (recency).

Citation-based ranking is represented as

Scorecitation(Di)=dataRefCount(Di)\text{Score}_\text{citation}(D_i) = \text{dataRefCount}(D_i)

Normalized ranking:

Rank(Di)=dataRefCount(Di)maxjdataRefCount(Dj)\text{Rank}(D_i) = \frac{\text{dataRefCount}(D_i)}{\max_j \text{dataRefCount}(D_j)}

A convex combination of recency and citations:

Score(Di)=αdataRefCount(Di)maxjdataRefCount(Dj)+(1α)currentYearyear(Di)maxj(currentYearyear(Dj))\text{Score}(D_i) = \alpha \cdot \frac{\text{dataRefCount}(D_i)}{\max_j \text{dataRefCount}(D_j)} + (1-\alpha)\cdot \frac{\text{currentYear} - \text{year}(D_i)}{\max_j(\text{currentYear}-\text{year}(D_j))}

4.2 Graph Traversal Strategies

Multi-hop and composite queries, such as identifying datasets linked to publications with a specific attribute or neighborhood expansion, utilize chained Cypher MATCH statements. At present, no sophisticated measures such as PageRank are applied, though Cypher’s algorithmic library (e.g., algo.pageRank.stream) could be integrated in future extensions.

4.3 Query Translation Pipeline

The query flow in DataChat can be represented as the following pseudocode:

1
2
3
4
5
6
7
8
9
10
11
12
13
def handle_user_query(nl_question):
    prompt = build_prompt(examples, nl_question)
    cypher = openai.Completion.create(model="gpt-3.5-turbo", prompt=prompt)
    try:
        result = neo4j_session.run(cypher)
        if view == "bot":
            text = [format_row(r) for r in result]
            return render_chat(text, cypher)
        else:  # DataChatViz
            graph = extract_graph(result)
            return render_graph(graph)
    except CypherSyntaxError as e:
        return render_error(e, cypher)

A textual flowchart sequence is:

  1. User inputs NL query in Streamlit
  2. Streamlit passes query to Backend Dialog Manager
  3. Dialog Manager assembles prompt and calls LLM API
  4. GPT-3.5-turbo returns Cypher query string
  5. Backend executes Cypher on Neo4j
  6. Neo4j returns result set (nodes/edges as JSON)
  7. Backend formats response for chat or visualization
  8. Streamlit updates UI with result

5. LLM Integration and Interaction Paradigms

The prompt-based LLM interface is prompt-sparse, employing fixed in-context exemplars and appending the current user question, without session memory or explicit conversational grounding. This approach provides deterministic translation of NL intent to executable Cypher, offering transparency to the user (visible Cypher in chat tab) and a lightweight educational effect.

System failures (e.g., Cypher syntax errors) are surfaced to the user, who can then iteratively refine their input. Prominent characteristics include the absence of ongoing dialogue state, the unidirectional prompt-response logic, and the exposure of intermediate code artifacts for transparency and explainability (Fan et al., 2023).

6. Scalability, Usability, and Extensibility Considerations

6.1 Scalability Constraints

The deployed prototype is based on a 1,642-dataset sample of the ICPSR repository; full deployment (exceeding 11,000 datasets) would necessitate graph sharding, more robust Neo4j cluster resources, and paginated visualizations. LLM inference latency (approximately 200–500 ms per turn) could rise with expanded prompt contexts or longer output.

6.2 User Experience and Interface Trade-Offs

Presenting both NL-to-Cypher translation and underlying graph structures enhances transparency and supports the CEVI (Context, Efficiency, Visibility, Interactivity) framework. However, exposing full Cypher code increases cognitive load, and large graph visualizations may become unwieldy.

Natural language input streamlines exploration compared to traditional multi-dropdown interfaces, while graph visualizations promote result visibility and interactivity, with explicit design constraints limiting graph rendering to a manageable subset of nodes.

6.3 Extensibility Pathways

Potential expansion vectors include enriching funder attributes, incorporating research fields, and adding topic models for publications. Optional integration of semantic vector search and graph-based ranking algorithms such as PageRank could further diversify retrieval capacities. These extensibility provisions are informed by observed limitations in current property-based ranking and fixed prompt structures.

7. Significance and Future Directions

DataChat demonstrates an architecture that bridges LLM-based NL understanding and graph-structured research data, providing a blueprint for conversational dataset discovery tools. Design trade-offs orient towards transparency, modular extensibility, and interactive user experience, while current limitations—absence of conversational memory, property-only ranking, limited scalability—highlight areas for future work. The explicit separation of LLM-driven translation, graph-structured storage, and multimodal presentation suggests a broader applicability to other domains requiring conversational access to heterogeneous research repositories (Fan et al., 2023).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to DataChat Architecture.