SOAEsV2-7B/72B: Full-Pipeline Optimization for State-Owned Enterprise LLMs via Continual Pre-Training, Domain-Progressive SFT and Distillation-Enhanced Speculative Decoding (2505.04723v1)

Published 7 May 2025 in cs.CL

Abstract: This study addresses key challenges in developing domain-specific LLMs for Chinese state-owned assets and enterprises (SOAEs), where current approaches face three limitations: 1) constrained model capacity that limits knowledge integration and cross-task adaptability; 2) excessive reliance on domain-specific supervised fine-tuning (SFT) data, which neglects the broader applicability of general language patterns; and 3) inefficient inference acceleration for large models processing long contexts. In this work, we propose SOAEsV2-7B/72B, a specialized LLM series developed via a three-phase framework: 1) continual pre-training integrates domain knowledge while retaining base capabilities; 2) domain-progressive SFT employs curriculum-based learning strategy, transitioning from weakly relevant conversational data to expert-annotated SOAEs datasets to optimize domain-specific tasks; 3) distillation-enhanced speculative decoding accelerates inference via logit distillation between 72B target and 7B draft models, achieving 1.39-1.52$\times$ speedup without quality loss. Experimental results demonstrate that our domain-specific pre-training phase maintains 99.8% of original general language capabilities while significantly improving domain performance, resulting in a 1.08$\times$ improvement in Rouge-1 score and a 1.17$\times$ enhancement in BLEU-4 score. Ablation studies further show that domain-progressive SFT outperforms single-stage training, achieving 1.02$\times$ improvement in Rouge-1 and 1.06$\times$ in BLEU-4. Our work introduces a comprehensive, full-pipeline approach for optimizing SOAEs LLMs, bridging the gap between general language capabilities and domain-specific expertise.

Collections

Sign up for free to add this paper to one or more collections.

Sign Up

Summary

SOAEsV2-7B/72B: A Full-Pipeline Optimization for Domain-Specific LLMs

The development of domain-specific LLMs presents unique challenges and opportunities, particularly for Chinese state-owned assets and enterprises (SOAEs). Typically, these models must balance the intricate domain-specific knowledge with the general language proficiency required for versatile applications such as intelligent decision support and professional report generation. This paper introduces a sophisticated approach embodied in the SOAEsV2-7B/72B model series, aiming to optimize these aspects through a comprehensive three-phase framework.

Continual Pre-Training

Continual pre-training forms the bedrock of this approach, integrating domain-specific knowledge while preserving general language capabilities. Utilizing a carefully curated dataset of 17B tokens derived from SOAEs-DataSuite, the pre-training process achieves a high degree of domain relevance through rigorous filtering and expert-guided augmentation. This phase is pivotal in maintaining foundational language abilities, as evidenced by retention metrics—SOAEsV2-72B exhibited a remarkable 99.8% of the base capacity when compared to the Qwen2.5-72B model.

Domain-Progressive Supervised Fine-Tuning (SFT)

The innovation is further extended through domain-progressive SFT, employing curriculum-based strategies that methodically transition from general data to highly specialized SOAEs annotations. Unlike single-stage fine-tuning, this progressive approach mitigates the abrupt shifts typical in domain tuning, thereby enhancing adaptation. Notably, it culminated in a nuanced improvement, with models showing an impressive lift, achieving 43.08 in Rouge-1 and 25.11 in BLEU-4 metrics upon completion of both stages of fine-tuning.

Distillation-Enhanced Speculative Decoding

To address inference inefficiencies associated with colossal models, the authors introduce a distillation-enhanced speculative decoding technique that substantially accelerates processing. By aligning a lighter, distilled 7B draft model with its 72B counterpart, the method enables cooperative inference, resulting in a speedup range of 1.39 to 1.52× without sacrifice to output quality. This cooperative inference strategy highlights the effective distributional equivalence achieved through logit-level distillation.

Implications and Future Work

The methodological advances presented are of notable significance, integrating both practical and theoretical improvements in LLM optimization. Practically, the framework provides SOAEs practitioners with tools for deploying LLMs that align with industry needs for knowledge depth while maintaining operational efficiency. Theoretically, the comprehensive pipeline approach redefines strategies for domain specificity in LLM training, establishing a viable path for future applications in other specialized domains such as finance and healthcare.

As LLMs progressively underpin more aspects of industrial intelligence transformation, this paper offers a substantive contribution to bridging the gap between broad language capabilities and domain-centric expertise. Future developments could explore extending these methods across various verticals and potentially integrating them with emerging developments in AI interoperability and cross-domain knowledge synthesis.

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Related Papers

Find Related Papers