Entity-Centric Knowledge Store

Updated 30 January 2026

Entity-centric knowledge stores are repositories that structure multimodal data around central entities using star-shaped graph topologies.
They integrate diverse data sources like text, images, and structured records, employing LLMs and embedding methods for precise entity matching.
Advanced querying and retrieval engines, leveraging dynamic ontologies and GNNs, enable efficient analytics, multi-hop reasoning, and scalable design.

An entity-centric knowledge store is a structured repository in which representations, relations, and retrieval mechanisms are centered around entities (such as persons, organizations, artifacts, or core concepts) rather than isolated mentions, fragments, or generic schema. Unlike traditional knowledge bases or graphs that emphasize global connectivity or universal schemas, entity-centric approaches prioritize the extraction, organization, enrichment, and utilization of data with respect to a central entity, often resulting in star-shaped or activity-centric graph topologies. These repositories underpin a wide spectrum of advanced applications including contextual analytics, retrieval-augmented generation, visual question answering, enterprise intelligence, and predictive modeling. They commonly integrate heterogeneous data modalities (text, image, structured databases), leverage LLMs, and provide robust reasoning and querying capabilities tailored to complex, multi-hop, and dynamic inference.

1. Core Entity-Centric Schema Designs

Entity-centric knowledge graphs (ECKGs) employ ontologies and schemas explicitly structured around a central node, typically an individual (e.g., patient, user) or object of interest. For instance, the Health & Social Person-Centric Ontology (HSPO) represents each patient as the central node connected via RDF/OWL triples to various facets: Disease, Intervention, SocialContext, and Demographics. Each facet is joined to Person via one property (e.g., hasDisease, hasIntervention, hasSocialContext). The schema enforces that every triple is organized so the subject or object is either the central entity or one of its direct facets, producing a star-shaped subgraph for each entity (Theodoropoulos et al., 2023).

Enterprise-focused entity-centric schemas extend this paradigm: the knowledge graph centers on a User node, with activity nodes (Task, Meeting, Document, Skill, Location) radiating out via typed edges (e.g., User–ATTENDS–Meeting, Person–WORKS_ON–Project). Semantic enrichment is performed by mapping internal entity representations to external ontologies (DBpedia, Wikipedia) for hierarchical classification and disambiguation (Kumar et al., 11 Mar 2025). In visual domains, an entity catalog is organized into semantic buckets (e.g., 22 categories with 7,568 entities each), supporting fine-grained retrieval and reasoning over richly annotated image datasets (Qiu et al., 2024).

2. Data Ingestion, Extraction, and Normalization

Entity-centric stores process raw data from diverse sources—EHRs, corporate documents, images, code repositories—with ingestion layers tailored for each modality. In the healthcare domain, raw ICU and clinical data are grouped and normalized to patient-centric records; ICD codes are collapsed into families to reduce sparseness, and unstructured notes are parsed with UMLS concept extraction for social context facets (Theodoropoulos et al., 2023). Enterprise systems employ connectors and crawlers for emails, calendars, logs, and documents, extracting de-identified text and entity metadata.

Entity extraction uses LLM-based summarization and prompt engineering for high-fidelity entity and relation identification. Outputs are further normalized via embedding-based matching: each candidate entity mention is embedded (sentence transformers, CLIP for images) and matched to stored representations in a vector database (e.g., FAISS) by cosine similarity, with thresholds (τ ≈ 0.8) balancing precision and recall. Entities not meeting criteria are instantiated anew (Kumar et al., 11 Mar 2025, Qiu et al., 2024). In code-centric systems, parsers extract AST nodes, commit metadata, and ticket references, constructing heterogeneous graphs representing the entire repository’s structure (Rao et al., 13 Oct 2025).

3. Relation Inference, Embedding, and Graph Construction

Relation extraction proceeds via LLM-inference, context triples, or graph-based analysis. Entity Context Graphs (ECG) eschew schema-restricted KG triplets, employing $\langle e_p,\,c,\,e_s\rangle$ context triples where c is a free-text context segment linking primary and secondary entities. Relations are encoded via neural CNNs, and embeddings are learned with a margin-ranking loss analogous to TransE:

$S(h,c,t) = -\|\;\widehat{\mathbf h}+\widehat{\mathrm{Enc}(c)}-\widehat{\mathbf t}\|_p$

Empirical results demonstrate that ECG embeddings (Hits@10 ≈ 70%) often outperform traditional KG-only (Hits@10 ≈ 47%) and can further enhance KG embeddings when trained jointly (Gunaratna et al., 2021).

In knowledge graph construction, relationships are matched via embedding-based scoring functions and mapped to dynamic ontologies. Enterprise-centric graphs apply TransE-style scoring for (e₁, r, e₂) triples:

$\phi(e_1, r, e_2) = -\|v_{e_1} + v_r - v_{e_2}\|_2$

Multi-granular ANN indices and minimal engram representations (id, name, type, description, source) ensure token-efficient storage (up to 94% reduction) and support dynamic associative retrieval (Liao, 10 Oct 2025).

Graph neural networks (GNNs) such as GraphSAGE and GAT serve as core representation engines. Node embeddings are updated via message passing, attention, and mean aggregation to support predictive tasks (e.g., 30-day ICU readmission). Multi-relation support via basis decomposition optimizes for schema heterogeneity.

4. Reasoning, Retrieval, and Analytics Engines

Entity-centric stores offer sophisticated reasoning mechanisms and retrieval engines for complex queries:

Hybrid retrieval orchestration routes queries to KBLam (multi-hop, aggregation), DeepGraph (single-hop), or embedding-based backends, as classified by a small LLM. KBLam fuses query and node embeddings via rectangular multi-head attention, maximizing answer relevance and minimizing latency (Rao et al., 13 Oct 2025).
Retrieval-augmented generation (RAG): EcphoryRAG activates cue entity embeddings and performs multi-hop associative walks via weighted centroids, recursively expanding and re-ranking candidates. This approach uncovers latent bridging entities and achieves substantial exact match and F1 improvements on multi-hop QA benchmarks (Liao, 10 Oct 2025).
Multimodal systems: SnapNTell employs region-level entity detection (GLIP), CLIP-based image-to-text retrieval, and knowledge aggregation (Wikipedia, KGs) to construct an entity “dossier” used in multimodal LLM generation. Prompt-augmentation and modality adapters enable answer synthesis tightly coupled to the entity (Qiu et al., 2024).

Advanced querying interfaces translate natural language to graph queries (Cypher, SPARQL). Analytics modules support expertise discovery, task prioritization, and custom metric aggregation directly over the entity store (Kumar et al., 11 Mar 2025).

5. Scalability, Maintenance, and Robustness

Entity-centric knowledge stores are architected for scalability, efficiency, and resilience to data heterogeneity and missingness:

Star-shaped mini-graphs per entity are naturally scalable: one graph per entity, readily parallelizable, and robust to missing facets. Ablation on healthcare ECKGs with up to 97% missing data demonstrates only marginal drops in accuracy and F1 due to central facets’ dominance (Theodoropoulos et al., 2023).
Token-efficient indexing: Engram-based memory systems store only core entity tuples, achieving up to 94% reduction in token usage versus full-text KG-RAGs, facilitating continual updates and “lifetime learning” through incremental indexation (Liao, 10 Oct 2025).
Dynamic maintenance modules: Systems integrate delta detectors for incremental graph and index updates, support sharding/partitioning by organizational unit, and cache hot subgraphs for low-latency analytics (Kumar et al., 11 Mar 2025, Rao et al., 13 Oct 2025).
Modality-agnostic ingestion: Pipelines can handle structured, semi-structured, and unstructured data, including images, text, code, and logs. Embedding techniques enable unified vector search and cross-modal retrieval (CLIP, BioBERT, sentence transformers).

6. Empirical Performance and Applications

Quantitative evaluation indicates entity-centric knowledge stores outperform conventional baselines across domains and tasks:

Application Domain	Relevant Metric(s)	Observed Performance
Enterprise Analytics	Entity Extraction Acc: 92%; Rel Extraction F₁: 89%; NDCG	Sub-second queries; 80% rel. improv.
Healthcare	Readmission Prediction F1	GNN-based PKGSage ~68% (vs. SVM ~64%)
QA (Multi-hop)	Exact Match (EM), Token Savings	EcphoryRAG: EM 0.475 (+0.083 over SOTA)
VQA	BLEURT (Semantic Score)	SnapNTell: 0.55 (+66.5% over baseline)

Entity-centric designs support applications including diagnosis prediction, expertise discovery, contextual search, visual question answering, code repository exploration, and personalized recommendation.

7. Lessons Learned, Challenges, and Extensions

Empirical studies highlight several best practices and limitations:

Schema design (central node, facet connectivity, relation heterogeneity) critically impacts downstream performance and trainability (Theodoropoulos et al., 2023).
Robustness to missing or sparse data is achieved via modular graph construction and resilient GNN architectures.
Associative multi-hop retrieval over entity-centric graphs allows dynamic reasoning without exhaustive pre-enumeration of all relations, but requires careful tuning (breadth, depth, seed weights) to manage noise and semantic drift (Liao, 10 Oct 2025).
Multimodal fusion and adapter-based architectures economize compute/memory while enabling flexible reasoning over images and text (Qiu et al., 2024).
Open-source frameworks (e.g., HSPO, PyTorch Geometric modules) facilitate rapid domain adaptation from healthcare to retail, manufacturing, or bibliometrics.

A plausible implication is that entity-centric knowledge stores are readily generalizable to any domain where the unit of interest is naturally describable via a set of attribute types and relations, provided that ontology design accommodates task-specific requirements and the ingestion pipeline is tailored to available data modalities (Theodoropoulos et al., 2023).