Knowledge Distillation of Domain-adapted LLMs for Question-Answering in Telecom (2504.20000v1)

Published 28 Apr 2025 in cs.CL, cs.IR, and cs.LG

Abstract: Knowledge Distillation (KD) is one of the approaches to reduce the size of LLMs. A LLM with smaller number of model parameters (student) is trained to mimic the performance of a LLM of a larger size (teacher model) on a specific task. For domain-specific tasks, it is not clear if teacher or student model, or both, must be considered for domain adaptation. In this work, we study this problem from perspective of telecom domain Question-Answering (QA) task. We systematically experiment with Supervised Fine-tuning (SFT) of teacher only, SFT of student only and SFT of both prior to KD. We design experiments to study the impact of vocabulary (same and different) and KD algorithms (vanilla KD and Dual Space KD, DSKD) on the distilled model. Multi-faceted evaluation of the distillation using 14 different metrics (N-gram, embedding and LLM-based metrics) is considered. Experimental results show that SFT of teacher improves performance of distilled model when both models have same vocabulary, irrespective of algorithm and metrics. Overall, SFT of both teacher and student results in better performance across all metrics, although the statistical significance of the same depends on the vocabulary of the teacher models.

PDF Abstract

Essay on Knowledge Distillation of Domain-Adapted LLMs for Question-Answering in Telecom

The paper "Knowledge Distillation of Domain-adapted LLMs for Question-Answering in Telecom" investigates the nuanced application of Knowledge Distillation (KD) for refining LLMs, specifically tailored for the telecommunications domain, within a question-answering framework. KD serves as a pragmatic approach to compress the size of LLMs while preserving task-specific performance, presenting a critical tool in enhancing model efficiency for specialized domains.

The research primarily explores the methodology of KD where a smaller, "student" model is trained to emulate the competencies of a larger, "teacher" model. This process is examined under the lens of telecom domain adaptation, a field where the intricacies of technical language demand precise model fine-tuning. The paper designed experiments to meticulously analyze the influence of Supervised Fine-tuning (SFT) applied to either the teacher model, the student model, or both prior to KD. Additionally, the impact of vocabulary similarity between the models and different KD algorithms such as Vanilla KD and Dual Space KD (DSKD) are meticulously evaluated.

The paper's approach is multi-dimensional, employing 14 distinct metrics for model evaluation, spanning N-gram metrics, embedding metrics, and Oracle-LLM based frameworks. This comprehensive evaluation strategy ensures a robust analysis of the distillation effects on model performance, uncovering critical insights into how domain adaptation through SFT affects the distilled model.

Significant findings from the research indicate that SFT of the teacher model enhances performance when the teacher and student share the same vocabulary, regardless of the chosen KD algorithm or evaluation metrics utilized. Moreover, employing SFT for both teacher and student models consistently results in superior model performance across all metrics, though the extent of this improvement varies with the vocabulary choices. The statistical analyses provided reinforce these outcomes, showcasing significant trends that underline the importance of strategic SFT applications in KD processes.

The implications of this research are manifold. Practically, the paper paves the way for more efficient deployments of domain-specific LLMs, particularly in settings where computational resources are limited. Theoretically, it opens avenues for future research in refining KD methods, potentially influencing subsequent developments in AI, focusing on scalability and effectiveness across diverse domains.

Reflecting on the prospects of future work, the paper suggests potential explorations into larger teacher models, integration with Mixture of Experts models, and application to other domains beyond telecom, such as code generation and complex agent-driven interactions. This approach enriches the discourse surrounding KD, encouraging further investigation into optimizing LLMs for specialized, resource-constrained environments.

In conclusion, the research offers a meticulously detailed exploration of KD in domain-specific LLMing, unveiling critical insights into model adaptation strategies and performance metrics. This work contributes significantly to the ongoing development of efficient and specialized AI applications, serving as a pivotal reference for researchers and practitioners aiming to refine LLMs in technical domains.

PDF Markdown Bookmark Chat (Pro)

Authors (5)

Rishika Sen (1 paper)
Sujoy Roychowdhury (9 papers)
Sumit Soman (18 papers)
H. G. Ranjani (16 papers)
Srikhetra Mohanty (1 paper)

Related Papers

Find Related Papers

YouTube

Show All Videos