A Token is Worth over 1,000 Tokens: Efficient Knowledge Distillation through Low-Rank Clone (2505.12781v1)

Published 19 May 2025 in cs.CL and cs.AI

Abstract: Training high-performing Small LLMs (SLMs) remains costly, even with knowledge distillation and pruning from larger teacher models. Existing work often faces three key challenges: (1) information loss from hard pruning, (2) inefficient alignment of representations, and (3) underutilization of informative activations, particularly from Feed-Forward Networks (FFNs). To address these challenges, we introduce Low-Rank Clone (LRC), an efficient pre-training method that constructs SLMs aspiring to behavioral equivalence with strong teacher models. LRC trains a set of low-rank projection matrices that jointly enable soft pruning by compressing teacher weights, and activation clone by aligning student activations, including FFN signals, with those of the teacher. This unified design maximizes knowledge transfer while removing the need for explicit alignment modules. Extensive experiments with open-source teachers (e.g., Llama-3.2-3B-Instruct, Qwen2.5-3B/7B-Instruct) show that LRC matches or surpasses state-of-the-art models trained on trillions of tokens--while using only 20B tokens, achieving over 1,000x training efficiency. Our codes and model checkpoints are available at https://github.com/CURRENTF/LowRankClone and https://huggingface.co/collections/JitaiHao/low-rank-clone-lrc-6828389e96a93f1d4219dfaf.

Summary

Efficient Knowledge Distillation Through Low-Rank Clone: A Technical Overview

The paper "A Token is Worth over 1,000 Tokens: Efficient Knowledge Distillation through Low-Rank Clone" addresses the challenge of training Small LLMs (SLMs) that maintain performance comparable to larger teacher models while significantly reducing computational and memory demands. By leveraging a novel technique called Low-Rank Clone (LRC), the authors propose a method to achieve high training efficiency, which can surpass state-of-the-art models using substantially fewer training data and resources.

Innovations of Low-Rank Clone

The LRC method tackles three major limitations observed in traditional SLM training approaches:

Information Loss in Hard Pruning: Traditional methods often perform hard pruning, which can result in significant loss of informative activations and knowledge embedded in the teacher models. LRC avoids this through a soft pruning mechanism enabled by low-rank projection matrices.
Inefficient Representation Alignment: Existing distillation techniques often utilize intermediate projections to align teacher and student activations, which necessitate complex alignment modules and can reduce efficiency. LRC directly aligns activations and weights between teacher and student models without the requirement for these added modules.
Underutilization of Informative Activations: The standard focus on attention scores typically ignores the valuable and high-dimensional activations from Feed-Forward Networks (FFNs). LRC integrates a complete utilization of activation signals in both attention and FFN layers.

Methodology

The core of LRC's methodology revolves around two primary components:

Low-Rank Projection: This step compresses the teacher's weights into a lower-dimensional space preserved in the student model's parameters. By generating student weights through these projection matrices, LRC guarantees minimal information loss from the teacher's original weights.
Activation Clone: LRC aligns the internal activations of the student model with those of the teacher model, including crucial FFN signals. This allows the student model to closely mimic the behavior dynamics of its teacher, ensuring high fidelity in knowledge transfer.

The effective combination of these components yields an alignment-free design that significantly simplifies the distillation process without sacrificing the quality of knowledge transfer.

Empirical Results

Extensive experiments demonstrate that LRC can achieve competitive or even superior performance compared to other SLM approaches using much fewer tokens—only 20B—achieving efficiency equivalent to training on trillions of tokens. Notable results included LRC's performance compared to leading models such as Qwen2.5 and Llama3.2, establishing LRC's efficacy in rapid and resource-efficient training.

The paper further explores result variations across different benchmarks including logical reasoning, commonsense understanding, and world knowledge tasks. LRC consistently showcased superior results, reflecting its robustness and versatility in diverse NLP tasks.

Implications and Future Directions

The introduction of LRC signifies considerable advancements in AI, particularly in the context of reducing resource requirements without sacrificing model quality. Practically, LRC can make high-performance LLMs more accessible, enabling deployment in environments where computational resources are limited, such as edge devices.

Theoretically, the principles underpinning LRC might stimulate further research into exploring more granular aspects of model compression and distillation. Future work may explore larger-scale evaluations to determine the upper efficiency bounds of LRC and additional refinements to further enhance its applicability across varied model architectures.

In summary, Low-Rank Clone is a compelling strategy that promises more effective SLM training, achieving high performance with unprecedented efficiency, and paving the way for broader utilization of advanced NLP models across different sectors.