KOCO-BENCH: Domain Specialization Benchmark
- KOCO-BENCH is a comprehensive benchmark that defines novel evaluation tasks and curated corpora for assessing LLMs' acquisition and application of domain knowledge.
- It integrates six emerging software domains, 11 frameworks, and 25 real-world projects to support both code generation and domain knowledge understanding tasks.
- Empirical results reveal that advanced LLMs struggle with API utilization and rule adherence, highlighting the need for improved domain specialization techniques.
KOCO-BENCH is a comprehensive benchmark designed to rigorously evaluate domain specialization methods for LLMs in real-world software development contexts. Unlike prior efforts that primarily assess “what” domain-specific knowledge LLMs already possess, KOCO-BENCH introduces curated knowledge corpora and evaluation tasks to measure both acquisition and application of new, previously unseen domain knowledge. It features six emerging software domains, 11 frameworks, and 25 real-world projects, incorporating both multi-granularity code-generation and domain knowledge understanding tasks, all supported by scalable, executable test suites. Empirical evaluation reveals that state-of-the-art LLMs and leading domain specialization techniques face substantial challenges, particularly in utilizing APIs, adhering to domain-specific rules, and handling project-scale integration, thereby highlighting urgent needs for methodological advancement (Jiang et al., 19 Jan 2026).
1. Coverage of Domains, Frameworks, and Projects
KOCO-BENCH is structured across six representative and emerging domains in contemporary software development:
| Domain | Frameworks | Projects (Count) |
|---|---|---|
| Reinforcement Learning (RL) | VeRL, Open-R1 | 12 (prime, PURE, ARES, etc.) |
| Agent | Smolagents | 4 (DeepSearchAgents, etc.) |
| Retrieval-Augmented Generation (RAG) | RAG-anything | 5 (BookWorm, etc.) |
| Model Optimization (MO) | TensorRT | 4 (FlagScale, Nemo, etc.) |
| Embodied AI | VSLAM-LAB, cosmos-rl, robocasa, trackerLab | Covered via Q&A only |
| Ascend Ecosystem | ascend-transformer-boost (Python), triton-ascend (C++) | Q&A only |
A total of 11 Python and C++ frameworks are included, anchoring 25 curated real-world projects. RL domains leverage seven projects on VeRL and five on Open-R1, while uniquely complex spaces, such as the Ascend Ecosystem and Embodied AI, are assessed via knowledge understanding tasks only. This breadth enables robust evaluation of domain knowledge transfer and specialization at both the framework and project levels (Jiang et al., 19 Jan 2026).
2. Knowledge Corpus Structure and Domain Knowledge Representation
Each framework in KOCO-BENCH is accompanied by an explicit, curated knowledge corpus consisting of:
- Official framework documentation (Markdown, HTML)
- Framework source code (Python or C++ classes and APIs)
- Usage examples (tutorials, scripts)
The median corpus size is 77,000 lines, with some exceeding 400,000 lines. The corpora cover explicit API signatures, documentation, implicit operational rules (such as data-format conventions), and recurrent software patterns (e.g., RL training/inference loops, RAG indexing workflows). These knowledge corpora simulate the “new” domain information that LLMs are expected to acquire through methods such as supervised fine-tuning (SFT) or retrieval-augmented generation (RAG), rather than relying solely on prior training exposure (Jiang et al., 19 Jan 2026).
3. Evaluation Tasks: Code Generation and Knowledge Understanding
KOCO-BENCH provides two complementary task types:
A. Domain Code Generation
- Input: Project or module descriptions, module division specifications, core function signatures and natural language descriptions.
- Output: Generated code implementing the required function, module, or entire project-level pipeline.
- Verification: Automated unit tests (∼8.6 per core function, branch coverage ensured), integration tests (∼2.3 per project), and Docker-based execution environments ensure functional correctness across granularities:
- Function-level
- Module-level
- Project-level (end-to-end pipelines)
B. Domain Knowledge Understanding
- Focused on frameworks or domains without direct code-generation candidates (e.g., Embodied AI, Ascend Ecosystem).
- Atomic, verifiable, single- or multi-choice Q&A (3.5 single-choice and 14.3 multi-choice questions per framework; 107 total).
- Answers are directly traceable to the knowledge corpus.
Unlike prior benchmarks, KOCO-BENCH makes LLMs explicitly extract, learn, and apply framework-level definitions, API constraints, and tacit domain rules from the supplied corpora, generating code or answering domain-targeted questions as required (Jiang et al., 19 Jan 2026).
4. Evaluation Methodology and Metrics
Rigorous evaluation protocols are central to KOCO-BENCH:
- Test Suite Construction: All unit and integration tests are hand-crafted or agent-generated, validated against reference code, achieving full branch coverage (via coverage.py).
- Metric Definitions:
- Pass@1: Fraction of tasks where the first model output passes all test cases.
- AvgPassRate (APR): Average proportion of tests passed per generated code sample.
- Pass@any: Probability that at least one out of k sampled generations passes all tests.
- Accuracy (ACC): Proportion of correctly answered Q&A items.
Retrieval-Augmented Generation (RAG) formalism: Let (natural language requirement) and (knowledge corpus), retrieval is performed using BM25: chunks of . The generation is:
kNN-LM smoothing blends model and retrieved neighbor probabilities:
where is a similarity kernel over hidden states.
All submitted generations must execute successfully within isolated Docker environments to guarantee reproducibility and environment fidelity (Jiang et al., 19 Jan 2026).
5. Experimental Results and Observed Challenges
Out-of-the-box LLMs: Leading proprietary LLMs exhibit low Pass@1 (≤8.5%) and moderate APR (≤25.3%) on code generation. Q&A accuracy peaks at 53%. All models fail completely on the RAG domain (0% correct API calls). Key figures:
| Model | Avg Pass@1 | APR | ACC (Q&A) |
|---|---|---|---|
| Gemini 2.5 Pro | 8.5% | 16.8% | ≈53% |
| Claude Sonnet 4.5 | 6.1% | 21.2% | |
| GPT-5 Mini | 6.6% | 25.3% |
Domain Specialization Methods: SFT, LoRA, RAG, and kNN-LM yield marginal improvements. Notably, RAG yields Pass@1 of 7.4% and Q&A accuracy of 30.3%, outperforming SFT and LoRA, but overall gains remain minor and inconsistent. Increasing corpus size unexpectedly hinders SFT/LoRA performance but leaves RAG largely unaffected.
Agentic Systems: Claude Code achieves a higher Pass@1 at 34.2% (APR 49.3%), and 62.5% Pass@1 on RAG tasks, but at a token cost of ≈620,000 per sample. Open-source agents (SWE-Agent, OpenHands) average 4–4.5% Pass@1.
Persistent Failure Modes:
- Hallucinated/invalid API calls (~33% of errors)
- Violations of data-format or range constraints
- Attribute, type, or key errors; undefined variables
- Cross-module interface mismatches at project level
A plausible implication is that LLMs trained for generic programming do not reliably extract or generalize domain-specific operational constraints from unfamiliar corpora, especially when explicit schema or symbolic reasoning is required (Jiang et al., 19 Jan 2026).
6. Insights, Methodological Limitations, and Recommendations
KOCO-BENCH results demonstrate that SOTA LLMs fail to reliably acquire and apply new domain knowledge, as measured by pass rates on unseen real-world code tasks. Conventional domain specialization mechanisms—SFT, RAG, kNN-LM—yield limited progress; RAG assists with factual retrieval but does not resolve failures involving implicit rules, data-format conventions, or function-level constraints.
Agentic systems, while promising, incur substantial inference cost and continue to misapply APIs and violate domain rules. Continual learning across frameworks within one domain is effective for retention, whereas cross-domain sequential fine-tuning introduces catastrophic forgetting. Notably, larger corpora degrade performance with SFT/LoRA, suggesting sensitivity to corpus pruning and distillation.
Recommended future directions:
- Develop code-centric domain specialization techniques with explicit API contracts, constraint modeling, and multi-file interface handling.
- Integrate symbolic reasoning and type-checkers during code generation to enforce correctness.
- Employ continual-learning algorithms with rehearsal or adapters to mitigate cross-domain forgetting.
- Explore hybrid retrieval-generation architectures combining extended context windows with structured API schemata.
By providing tightly-aligned knowledge corpora, multi-granularity executable tasks, and corpus-grounded Q&A, KOCO-BENCH establishes a robust, principled framework for evaluating not only what LLMs already know but how they learn and operationalize new domain knowledge in software development (Jiang et al., 19 Jan 2026).