Bottom-up Domain-specific Superintelligence: A Reliable Knowledge Graph is What We Need (2507.13966v1)

Published 18 Jul 2025 in cs.CL and cs.AI

Abstract: LLMs traditionally used for cross-domain generalization have recently demonstrated task-specific reasoning. However, their top-down training approach on general corpora is insufficient for acquiring abstractions needed for deep domain expertise. This may require a bottom-up approach that acquires expertise by learning to compose simple domain concepts into more complex ones. A knowledge graph (KG) provides this compositional structure, where domain primitives are represented as head-relation-tail edges and their paths encode higher-level concepts. We present a task generation pipeline that synthesizes tasks directly from KG primitives, enabling models to acquire and compose them for reasoning. We fine-tune LLMs on the resultant KG-grounded curriculum to demonstrate domain-specific superintelligence. While broadly applicable, we validate our approach in medicine, where reliable KGs exist. Using a medical KG, we curate 24,000 reasoning tasks paired with thinking traces derived from diverse medical primitives. We fine-tune the QwQ-32B model on this curriculum to obtain QwQ-Med-3 that takes a step towards medical superintelligence. We also introduce ICD-Bench, an evaluation suite to quantify reasoning abilities across 15 medical domains. Our experiments demonstrate that QwQ-Med-3 significantly outperforms state-of-the-art reasoning models on ICD-Bench categories. Further analysis reveals that QwQ-Med-3 utilizes acquired primitives to widen the performance gap on the hardest tasks of ICD-Bench. Finally, evaluation on medical question-answer benchmarks shows that QwQ-Med-3 transfers acquired expertise to enhance the base model's performance. While the industry's approach to artificial general intelligence (AGI) emphasizes broad expertise, we envision a future in which AGI emerges from the composable interaction of efficient domain-specific superintelligent agents.

Summary

The paper presents a pipeline that generates reasoning tasks from multi-hop KG paths to instill deep, domain-specific expertise in language models.
It demonstrates that curriculum-tuned models, especially QwQ-Med-3, achieve up to 20% higher accuracy and enhanced compute efficiency on ICD-Bench.
The study highlights that grounding LLMs in KG-derived primitives enables robust compositional reasoning and strong transfer to external medical benchmarks.

Bottom-Up Domain-Specific Superintelligence via Knowledge Graph-Grounded Curricula

This paper presents a systematic approach to eliciting domain-specific superintelligence in LLMs by leveraging bottom-up curricula derived from reliable knowledge graphs (KGs). The authors argue that current LLMs, trained in a top-down fashion on broad, unstructured corpora, lack the compositional, axiomatic understanding required for deep expertise in specialized domains. They propose a pipeline that synthesizes reasoning tasks directly from KG paths, enabling explicit acquisition and composition of domain primitives. The methodology is instantiated in the medical domain using the UMLS KG, and the resulting models are evaluated on a new benchmark, ICD-Bench, designed to probe compositional reasoning across 15 medical sub-specialties.

Motivation and Theoretical Framework

The central thesis is that superintelligence in a domain is characterized by depth—outperforming human experts in specialized reasoning—rather than breadth alone. The authors critique the limitations of top-down LLM training, which tends to capture surface-level regularities but fails to internalize the structured, compositional abstractions necessary for expert-level reasoning. They posit that a bottom-up approach, grounded in explicit domain primitives and their relations (as encoded in KGs), is essential for developing such expertise.

The paper formalizes the use of KGs as scaffolds for curriculum generation. Each KG triple (head, relation, tail) represents a domain primitive, and multi-hop paths encode higher-order concepts. By traversing these paths, the pipeline generates reasoning tasks that require the model to compose primitives into complex inferences.

Task Generation Pipeline

The pipeline consists of the following stages:

Path Sampling: Multi-hop paths are sampled from the KG, with explicit control over path length (complexity) and node diversity to ensure broad coverage and steerable reasoning depth.
QA Task Synthesis: Each path is mapped to a multiple-choice question using a backend LLM, with the question requiring reasoning along the entire path. Distractor options are generated to enforce discriminative reasoning.
Thinking Trace Generation: For each QA pair, a detailed step-by-step reasoning trace is distilled from a strong LLM, grounded in the KG path.
Quality and Correctness Filtering: Multi-stage filtering, including dual independent LLM graders, ensures that only high-quality, factually aligned (question, trace, answer) triplets are retained.
Curriculum Assembly: The resulting dataset forms a bottom-up curriculum, with explicit annotation of reasoning complexity and domain coverage.

The pipeline is instantiated on the UMLS KG, yielding a curriculum of 24,000 high-quality medical reasoning tasks with associated thinking traces.

Model Training and Inference

The QwQ-32B model is fine-tuned on the KG-grounded curriculum using supervised next-token prediction, with thinking traces delimited for explicit reasoning supervision. Three curriculum-tuned variants are produced, each incorporating increasing path depth and diversity (QwQ-Med-1: 1-hop, QwQ-Med-2: 1-2 hops, QwQ-Med-3: 1-3 hops).

At inference, two scaling strategies are explored:

Parallel Sampling: Multiple independent reasoning traces are generated in parallel, with majority voting for answer selection.
Iterative Refinement: Reasoning traces are extended via iterative self-reflection, prompting the model to re-examine its reasoning before finalizing an answer.

ICD-Bench: Evaluation Suite

ICD-Bench is introduced as a rigorous benchmark for domain-specific reasoning, comprising 3,675 QA items stratified across 15 ICD-10 medical categories. Each item is grounded in a multi-hop KG path, with complexity controlled by path length (2-5 hops). The benchmark is designed to probe both breadth (across sub-specialties) and depth (compositional reasoning).

Experimental Results

Inference-Time Scaling

Parallel scaling yields superior gains for curriculum-tuned models, especially as curriculum depth increases. Iterative refinement offers diminishing returns beyond shallow curricula.
Compute-optimality is achieved: Models trained on deeper, more diverse curricula reach higher accuracy with lower inference-time compute, indicating more efficient reasoning.

Domain-Specific Reasoning

QwQ-Med-3 outperforms all baselines (including proprietary models such as o3 and Gemini-2.5-Pro) across all ICD-Bench categories, with 10-20% higher accuracy in most domains.
Performance gains are most pronounced in less prevalent sub-specialties, where generalist models underperform due to limited representation in pretraining corpora.

Robustness to Task Difficulty

Curriculum-tuned models exhibit greater robustness across the full spectrum of task difficulty, with the performance gap over baselines widening on the hardest tasks.
Marginal gains from curriculum depth are concentrated on challenging tasks: Deep multi-hop exposure is essential for bridging the gap between recall and compositional reasoning.

Ablation: Depth vs. Diversity

Both path depth and diversity are critical: Deeper paths yield substantial gains, but balanced complexity sampling and broad entity coverage further improve performance and generalization.
Optimal curriculum depth is task-dependent: Shallow paths suffice for easy tasks, while deep multi-hop traces are necessary for the most complex reasoning.

Recall vs. Reasoning

Curriculum-tuned models effectively recall and utilize KG primitives in their reasoning traces, achieving high alignment with ground-truth paths.
Base models can recall but fail to compose: They retrieve relevant facts but cannot integrate them into coherent multi-step reasoning, highlighting the necessity of explicit compositional supervision.

Generalization to External Benchmarks

QwQ-Med-3 demonstrates strong transfer to external medical QA benchmarks (MedQA, MedMCQA, MMLU-Med, PubMedQA), matching or exceeding the performance of other open-source models.
Curriculum-tuned models generalize beyond the original KG, indicating that bottom-up acquisition of domain primitives supports broader medical reasoning.

Implications and Future Directions

Practical Implications

Domain-specific superintelligence can be achieved with relatively small models (e.g., 32B parameters) via bottom-up curriculum tuning, offering substantial reductions in training and inference energy costs compared to monolithic AGI-scale models.
KG-grounded curricula provide a scalable, verifiable framework for synthesizing high-quality reasoning data in domains with reliable KGs (e.g., medicine, chemistry, law).

Theoretical Implications

Bottom-up acquisition of compositional primitives is essential for deep expertise, challenging the sufficiency of top-down, generalist pretraining for superintelligent performance.
A modular, compositional model of AGI is proposed, wherein interacting domain-specific superintelligent agents, each grounded in their own KG-derived curricula, collectively achieve general intelligence via recursive composition and communication.

Limitations and Open Problems

Dependence on high-quality KGs: The approach is currently limited to domains with reliable, dense KGs. Extending to domains lacking such resources remains an open challenge.
Closed-ended task focus: The current pipeline generates multiple-choice questions; generating open-ended, real-world tasks from KGs is non-trivial.
Difficulty estimation: The use of oracle-based difficulty heuristics may not generalize; model-based difficulty metrics without ground-truth answers are needed.

Future Work

Integration with reinforcement learning: KG primitives can serve as verifiable rewards for RL-based curriculum optimization, enabling dense, simulatable training environments.
Process reward models (PRMs): Stepwise verification using KG primitives can further improve inference-time scaling and trace fidelity.
Expansion to other domains: Systematic curation of KGs and domain abstractions in fields beyond medicine is a key direction for generalizing the methodology.

Conclusion

The paper demonstrates that explicit, bottom-up curriculum tuning using KG-derived reasoning tasks enables the emergence of domain-specific superintelligence in LLMs. The approach yields models that not only outperform state-of-the-art baselines on compositional reasoning benchmarks but also generalize to external tasks, all while being more compute- and energy-efficient. The results support a paradigm shift toward modular, compositional AGI architectures grounded in verifiable domain primitives, with KGs serving as a foundational abstraction for scalable, reliable expertise acquisition.

PDF Markdown

Follow-up Questions

Related Papers

Authors (3)

Tweets

https://twitter.com/fly51fly/status/1947410079990579247

https://twitter.com/cyberandy/status/1948635313670132041

YouTube

Show All Videos