Data-Centric Knowledge Injection
- Data-centric knowledge injection is a method that integrates external, structured domain knowledge into model training, ensuring compliance with defined constraints.
- It employs discrete models, entity mapping, and synthetic templating to convert external knowledge into actionable training data without modifying core architectures.
- Empirical studies demonstrate its impact through improved accuracy, retention, and privacy in federated learning, NLP, and recommender systems.
Data-centric knowledge injection refers to strategies that enrich machine learning models—especially deep neural networks and large-scale federated or LLMs—by encoding, generating, and systematically incorporating external, structured, or domain knowledge into the training data, loss functions, or optimization flow, with minimal dependence on architectural modifications. In contrast to model-centric approaches (which alter network topologies or introduce explicit knowledge-network layers), data-centric pipelines rely on preparing, formalizing, and integrating knowledge representations as part of the data and objective, enabling practical, extensible, and often privacy-preserving knowledge integration across a broad spectrum of applications.
1. Formal Representations and Knowledge Encoding
Data-centric knowledge injection begins by specifying, extracting, or generating structured knowledge elements suitable for integration into the model training process.
- Discrete Knowledge Models (KMs): In federated learning, each data owner formalizes domain expertise as one of two types: prediction KMs () yielding deterministic label assignments, and range KMs () defining label support sets (e.g., allowed class ranges). Consistency constraints require for all (Fan et al., 2022).
- Entity-centric Knowledge: In document-level NLP, entity spans are algorithmically detected and mapped to external knowledge graphs (e.g., Wikidata, Wikipedia2Vec), producing dense knowledge representations associated with input tokens (Wang et al., 2022, Garcia-Olano et al., 2021).
- Instructional and Paraphrastic Data: For LLMs, freshly curated factual data are extracted, deduplicated, and paraphrased at multiple surface forms. Synthetic instruction–response pairs, multi-hop QA chains, and diverse question templates encapsulate factual knowledge in formats that facilitate stronger model assimilation (Ovadia et al., 8 Apr 2025, Zhang et al., 12 Oct 2024, Zhao et al., 7 Mar 2025).
- Hybrid Ternary Knowledge: In large-scale recommenders, historic user–item–context interaction triples are encoded into explicit, retrievable vectors, forming an indexed knowledge base that decouples knowledge capacity from parameter limits (Qin et al., 21 Jan 2024).
2. Knowledge Injection Objectives and Transformation Operators
Given encoded knowledge, data-centric injection defines mathematical and algorithmic transformations that guarantee the target model meaningfully utilizes these elements within standard learning or federated optimization protocols.
- Function Transformation in FL: Federated learning leverages a personalized transformation applied to each client’s global model :
This infinite-masking ensures strict range constraints, while controls the degree of trust in point predictions (Fan et al., 2022).
- Knowledge-Infilling and Masked Objectives: For encoder–decoder LMs, knowledge spans are inserted and masked using structured templates, with loss terms computed exclusively on masked spans for robust knowledge reconstruction (e.g., in KILM) (Xu et al., 2023).
- Adapter Modules and Bottleneck Compression: Domain KB facts are embedded via adapter modules (small MLPs) atop each LM layer, trained to memorize declarative fact templates. Downstream fine-tuning fuses these with the base model via learnable gate weights, without affecting the backbone (Emelin et al., 2022).
- Contrastive and Multi-similarity Losses: In domain retrieval or QA contexts, contrastive training aligns chunks/sentences with their knowledge-augmented variants (e.g., in DR.EHR) or ensures representations of knowledge-enriched and vanilla documents are closely matched in latent space (Zhao et al., 24 Jul 2025).
3. End-to-End Algorithmic Workflows
Data-centric knowledge injection is operationalized in multi-step learning protocols that balance knowledge integration, privacy, and scalability.
- Distributed Federated Protocols: Each client applies a personalized model to private data and knowledge, optimizes locally for steps, and shares only updated weights, never KM parameters or raw data. The server averages updates and ensures every local model output fully respects injected knowledge (“infinite mask” ensures strict support) (Fan et al., 2022).
- Layer-wise Distributed Injection and Supervision: In both NLP and vision tasks, external entity knowledge is transformed to the backbone embedding space (e.g., via a learned linear map to BERT’s domain), concatenated with token/region features, and fused into model representations prior to classification (Wang et al., 2022, Garcia-Olano et al., 2021).
- Synthetic Data Pipeline: Extraction–deduplication–paraphrasing–instructional templating forms a robust pipeline. For example, Knowledge-Instruct iteratively extracts facts and paraphrases from limited-domain corpora, converts all to instruction–response pairs, and jointly fine-tunes models with a mix of these and original SFT data to prevent catastrophic forgetting (Ovadia et al., 8 Apr 2025).
- Retrievable Knowledge-Base Augmentation: In D2K, all interaction triplets in historical data are encoded and stored; at inference, the target sample retrieves relevant vectors, adapts them via personalized networks, and injects the knowledge at the input or intermediate stages of arbitrary recommenders (Qin et al., 21 Jan 2024).
4. Theoretical Guarantees, Evaluation Metrics, and Analytical Insights
Data-centric approaches yield precise theoretical and empirical properties:
- Output Validity and Constraint Satisfaction: For FL models with infinite masking, outputs are always valid distributions (), guarantee minimum mass on predicted classes (), and strictly respect range constraints () (Fan et al., 2022).
- Privacy: Only per-batch gradients are shared. No explicit parameters or formulas representing proprietary knowledge models are ever exposed, thus maintaining privacy at the same level as classical federated learning (Fan et al., 2022).
- Retention and Generalization: In LLMs, empirical studies demonstrate that QA-style data-centric injection produces up to 48% retention of injected facts, whereas mapping-style (translation, JSON) formats achieve only 17–20%. Scaling laws indicate that retention increases monotonically with model size but gaps between comprehension-oriented and mapping tasks persist (Jan et al., 22 May 2025).
- Catastrophic Forgetting: Mixing general instruction-tuning data or original pretraining samples with new-knowledge data is essential to prevent significant drops in existing capabilities during injection (Ovadia et al., 8 Apr 2025).
- Evaluation Protocols: Knowledge injection is assessed via direct probe questions, indirect (generic) probes for semantic integration, and, in federated regimes, via accuracy and constraint violation rates over client test sets (e.g., test accuracy, % of range-KM violations) (Fan et al., 2022, Jan et al., 22 May 2025).
5. Empirical Results and Benchmark Gains
Extensive empirical validation underpins data-centric knowledge injection.
- Federated Learning: With four industry clients (coal-mixing task), FL with KM injection led to the highest test accuracy in 3/4 subpopulations, strictly zero range-violation rate, and consistently outperformed FL without KMs and local models. On public datasets (Covtype, FMNIST), benefits were most pronounced at low (1–5%) data volumes (Fan et al., 2022).
- Encoder–Decoder LMs: KILM improved zero-shot entity-disambiguation F1 from 42.7% (BART-base) to 75.3% (+76%), halved hallucination rates in appositive generation (e.g., Wiki-ORG Not-Hallucinated: 49.7%→61.0%), and preserved or improved general NLU and summarization scores (Xu et al., 2023).
- Dialogue Systems: Slotting domain knowledge via adapters increased knowledge-probing accuracy on MultiWOZ 2.2 from 76.5% (BART) to 85.0% (adapter), and improved response generation success rate by 8.1 points (Emelin et al., 2022).
- Recommender Systems: D2K delivered AUC improvements of 1.4–2.4% on production-scale CTR benchmarks and maintained performance as the knowledge-base scaled with historical volume (Qin et al., 21 Jan 2024).
- Generalization and Retention: Instruction-based data-centric injection (Knowledge-Instruct, Ski) achieved 76–81% accuracy on completely unseen knowledge with minimal loss on general capabilities, outperforming continual pretraining and standard SFT by 12–20 points (Ovadia et al., 8 Apr 2025, Zhang et al., 12 Oct 2024).
Empirical ablations reinforce that comprehension-aligned instruction or QA templates are essential for successful injection—document or mapping formats (translation, text-to-JSON) yield far lower retention and poor transfer to unseen prompt patterns (Zhao et al., 7 Mar 2025, Jan et al., 22 May 2025). Diversity in surface realization via paraphrasing and explicit multi-form augmentation (e.g., Ski framework’s fine-grained QA and assembly) further boosts integration.
6. Limitations, Open Questions, and Best Practice Guidelines
While data-centric knowledge injection yields strong privacy, compliance, and scalability properties, limitations remain.
- Knowledge Type Coverage: Current techniques support point/deterministic outputs and hard range constraints; richer logical and constraint-based (e.g., first-order, multi-hop) knowledge modules require extension (Fan et al., 2022).
- Alignment and Conflict: Participant KMs must be mutually non-conflicting; ensuring this automatically for complex rules is open (Fan et al., 2022).
- Scaling Paraphrase Diversity: Sample diversity for paraphrase-based injection is bounded by external LLM/public API costs, motivating latent-level augmentation approaches (LaPael) that amortize paraphrase generation (Kang et al., 1 Nov 2024).
- Semantic Integration: Even with high direct probe retention, transfer to unseen, generic contexts remains partial; injected knowledge is often shallowly encoded. Strategies to consolidate and integrate new facts more deeply into the model’s world representation remain an active topic (Jan et al., 22 May 2025).
- Instruction Design and Data Efficiency: Empirical results confirm that QA and blank-filling tasks yield higher knowledge transfer than translation/JSON, regardless of model size. Best practices include curating atomic facts, maximizing prompt diversity, interleaving comprehension and mapping tasks, mixing training, and favoring small learning rates and early stopping (Jan et al., 22 May 2025, Ovadia et al., 8 Apr 2025).
- Downstream Integration: In recommender and federated settings, not all model architectures naturally support explicit knowledge-injection vectors; careful interface and adaptation function engineering remain necessary (Qin et al., 21 Jan 2024).
Data-centric knowledge injection is thus both a robust and practical paradigm, with demonstrated benefits across privacy-sensitive, federated, low-data, and instruction-heavy domains. It restructures the knowledge-integration landscape by shifting focus from model surgery to principled data and loss design, aligning injected expertise with intended usage, and unlocking reliable, scalable, and privacy-preserving model updates in diverse settings.