Code Digital Twin: Empowering LLMs with Tacit Knowledge for Complex Software Development (2510.16395v2)

Published 18 Oct 2025 in cs.SE

Abstract: Recent advances in LLMs have demonstrated strong capabilities in software engineering tasks, raising expectations of revolutionary productivity gains. However, enterprise software development is largely driven by incremental evolution, where challenges extend far beyond routine coding and depend critically on tacit knowledge, including design decisions at different levels and historical trade-offs. To achieve effective AI-powered support for complex software development, we should align emerging AI capabilities with the practical realities of enterprise development. To this end, we systematically identify challenges from both software and LLM perspectives. Alongside these challenges, we outline opportunities where AI and structured knowledge frameworks can enhance decision-making in tasks such as issue localization and impact analysis. To address these needs, we propose the Code Digital Twin, a living framework that models both the physical and conceptual layers of software, preserves tacit knowledge, and co-evolves with the codebase. By integrating hybrid knowledge representations, multi-stage extraction pipelines, incremental updates, LLM-empowered applications, and human-in-the-loop feedback, the Code Digital Twin transforms fragmented knowledge into explicit and actionable representations. Our vision positions it as a bridge between AI advancements and enterprise software realities, providing a concrete roadmap toward sustainable, intelligent, and resilient development and evolution of ultra-complex systems.

Summary

The paper introduces the Code Digital Twin framework to augment LLMs by integrating tacit software knowledge for complex development tasks.
It employs a multi-stage extraction pipeline to convert undocumented design rationales and historical decisions into explicit, actionable resources.
Preliminary case studies show improved accuracy in issue localization and more efficient application generation compared to traditional LLM-agent frameworks.

Empowering LLMs with Tacit Knowledge for Complex Software Development

The paper "Code Digital Twin: Empowering LLMs with Tacit Knowledge for Complex Software Development" (2510.16395) addresses the challenges faced in leveraging LLMs for complex software engineering tasks. The authors propose a new framework called the Code Digital Twin, aimed at bridging the gap between AI capabilities and enterprise software development, where tacit knowledge plays a crucial role.

Introduction

Recent advances in LLMs have introduced significant productivity gains in certain software engineering (SE) tasks, such as code completion and bug detection. However, many enterprise software development challenges stem from incremental evolution and dependency on tacit knowledge—such as design decisions and historical trade-offs—that LLMs alone cannot address. To effectively support complex software development, it is crucial to align AI capabilities with the practical realities of software development, acknowledging both the limitations and opportunities that exist in bridging AI advancements with enterprise software systems.

Challenges and Opportunities

The paper identifies 11 key challenges in complex software development from both software and LLM perspectives. From the software perspective, challenges include system complexity, lack of explicit conceptual representation, historical knowledge loss, and socio-technical dependencies. Conversely, LLM challenges involve task formalization, context-aware reasoning, trust and human oversight, and intelligent assistant design. The authors propose actionable opportunities to address these challenges, such as maintaining explicit dependency graphs, capturing undocumented knowledge, and employing LLMs as augmented intelligence rather than fully autonomous agents.

Code Digital Twin Framework

The proposed Code Digital Twin framework models both the physical and conceptual layers of software systems and is designed to evolve with the codebase. It integrates source code, design rationales, dependencies, and more in a unified structure. By employing hybrid knowledge representations and using extraction pipelines, the framework aims to transform tacit and fragmented knowledge into explicit, actionable resources. The framework also supports human-in-the-loop feedback and LLM-empowered applications, providing developers with context-aware collaboration and enhancing decision-making for tasks such as issue localization and impact analysis.

Methodology

The construction of the Code Digital Twin involves a rigorous multi-stage extraction pipeline. It combines structured representations, like knowledge graphs, and unstructured text data from source files, commit messages, and developer discussions. The framework maintains continuous co-evolution with the software system, ensuring alignment with evolving codebases and enabling practical applications, such as context-aware development support and autonomous system maintenance.

Preliminary Results

The framework's utility was demonstrated through two case studies: issue localization and application generation. For issue localization, experiments on large repositories improved LLMs' accuracy in identifying defective code elements by incorporating concept-functionality knowledge. In the application generation case, the framework facilitated more efficient development processes by structuring dependencies and providing coherent guidance across features, outperforming existing LLM-agent frameworks.

Conclusion

This research introduces a novel approach for augmenting LLM capabilities with the tacit knowledge essential for complex software development. By establishing the Code Digital Twin, the paper proposes bridging the gap between AI advancements and the practical realities of enterprise software systems. Future research should further explore scalability, adaptability, and the long-term evolution of this framework, enhancing AI's role in knowledge-centric software development.