In-Context Knowledge Distillation

Updated 3 January 2026

In-Context Knowledge Distillation is a framework that trains student models using complete context blocks—demonstrations plus queries—to compress and transfer nuanced reasoning abilities.
IC-KD employs diverse methodologies, including self-distillation, meta in-context tuning, and feature-space alignment, to achieve significant gains in few-shot learning and model compression.
Empirical and theoretical analyses show that IC-KD enhances robustness and efficiency by aligning teacher outputs with context-dependent student learning, reducing computational costs and latency.

In-Context Knowledge Distillation (IC-KD) denotes a family of training and theoretical frameworks that generalize classical knowledge distillation to settings where the student model learns not only from teacher outputs on isolated samples, but also from the structured, context-dependent phenomena that arise from processing sets of "in-context" examples. IC-KD methods have been proposed for both NLP and vision settings, encompassing distillation at training time (to compress in-context learning ability into smaller models), at inference time (as a perspective on attention and parameter adaptation), and across retrieved neighborhoods in representation space. Recent instantiations of IC-KD have demonstrated substantial empirical gains in few-shot generalization and robustness, while also yielding theoretical insights into the mechanisms underlying in-context learning in LLMs.

1. Formal Definitions and Distillation Objectives

IC-KD generalizes standard teacher-student knowledge distillation by treating the entire in-context data block—demonstrations plus queries—as the distillation unit. Let $\mathcal{T}$ be the teacher and $\mathcal{S}_\theta$ the student. Given a pool $\mathcal{D} = \{d_1,\ldots,d_n\}$ of demonstrations and a query $x$ , the teacher output $r = g(f_\mathcal{T}(\mathcal{D}, x))$ (where $g$ extracts rationale+answer) is used as the target. The student, presented with $k\leq n$ demonstrations $\mathbf{d} = (d_{i_1}, \ldots, d_{i_k})$ concatenated with $x$ (the prefix), is trained to maximize the sequence-level likelihood of $r$ :

$\mathcal{L}_\mathrm{KD} = -\frac{1}{N}\sum_{j=1}^N \sum_{i=1}^{L^{(j)}} \log S_\theta(r^{(j)}_i \mid \mathrm{pre}^{(j)}, r^{(j)}_{<i}),$

where $S_\theta$ is the student token distribution. Alternative IC-KD objectives augment this form with additional soft-label (teacher-driven) or hard-label (ground truth) losses, and may be extended to include language modeling or representation-level objectives (Wang et al., 2024, Huang et al., 2022, Duan et al., 2024, Zhu et al., 13 Jan 2025).

2. Methodologies of IC-KD: Pipeline and Variations

IC-KD implementations fall into several categories:

Self-distillation with "heavily prompted" teachers: Methods such as SeCoKD use an 8-shot teacher prompt to generate rationales and answers, then fine-tune the student on k-shot (often $k=1$ ) prefixes by matching the teacher’s output sequence at the token level. This compresses multi-example in-context reasoning into the student’s parameters, enabling few-shot or even zero-shot generalization (Wang et al., 2024).
Meta In-Context Tuning (Meta-ICT) and Multitask-ICT: IC-KD can be used in a meta-training regime over many tasks (Meta-ICT, training only on in-context inputs) or in an adaptation regime directly on the target few-shot tasks (Multitask-ICT, fine-tuning with all in-context loss components). Both regimes combine in-context learning objectives and language modeling objectives, often via an interpolated loss with teacher soft-labels and ground truth (Huang et al., 2022).
Feature-space in-context distillation: In vision settings, IC-KD incorporates relationships across in-context samples (neighbors in feature space) with positive and negative distillation terms—Positive In-Context Distillation (PICD) aligns the student outputs with those of same-class teacher neighbors, while Negative In-Context Distillation (NICD) enforces separation from different-class neighbors (Zhu et al., 13 Jan 2025).
Context distillation for efficiency: Typically, the distillation process involves encoding the teacher’s dependency on in-context examples into the student during training. At inference time, the student can operate without the original context, significantly reducing latency and memory requirements while preserving in-context generalization (Duan et al., 2024).

3. Theoretical Analyses: Attention as Distillation and Generalization Bounds

IC-KD offers a theoretical unification of several perspectives on in-context learning:

Attention as implicit KD: A single softmax-attention pass over a prompt $X=[X_D,X_Q]$ implements a weight initialization $W_0$ determined by the demonstrations $X_D$ , which can be viewed as a one-step knowledge distillation from the teacher mapping $f_T(x; \theta_\mathrm{LLM}) = W^V x$ into a student $f_S(\cdot; W) = W \, \varphi(W^K x)$ . The corresponding loss

$\mathcal{L}_\mathrm{KD}(W) = \frac{1}{N} \sum_{i=1}^N \| W\varphi(W^K x_i) - W^V x_i \|_2^2$

is minimized by $W_0$ , matching the structure of the reference model instantiated by the prompt (Li et al., 13 Jun 2025).

Generalization via Rademacher-complexity bounds: The excess distillation error on the true target distribution can be bounded by the empirical KD loss on the prompt and a model-capacity-dependent term scaling as $1/\sqrt{N}$ , where $N$ is the number of demonstrations. This quantifies the impact of prompt length and weight norms on the generalization of the implicit KD (Li et al., 13 Jun 2025).
Bias from prompt-target distribution mismatch: The bias in the distilled weights grows linearly with the maximum mean discrepancy (MMD) between the prompt and target distributions. Thus, prompt demonstration selection directly modulates distillation fidelity; well-aligned prompts minimize KD error (Li et al., 13 Jun 2025).

4. Empirical Results and Benchmark Comparisons

IC-KD outperforms pointwise supervised fine-tuning (SFT) and vanilla KD across a diverse set of reasoning, classification, and representation learning benchmarks.

Few-shot reasoning (SeCoKD) on Llama 3-8B:

| Method | 0-shot | 1-shot | |-------------|--------|--------| | Base | 40% | 60% | | SFT | 65% | 70% | | SeCoKD-S | 85% | 78% | | SeCoKD-M | 86% | 79% |

Relative gains are +45 percentage points (pp) in 0-shot and +18 pp in 1-shot against base; +20/+8 pp over SFT. Notably, both SeCoKD-S and SeCoKD-M reach near-optimal performance with even a single demonstration, underscoring the efficiency of context compression via KD (Wang et al., 2024).

Robustness and improvement score (IS):

SeCoKD achieves higher IS values (more queries become "easy") than SFT, e.g., IS = 3.56 on GSM8K (SeCoKD) vs. 1.08 (SFT) (Wang et al., 2024).

Cross-task transfer: IC-KD yields robust improvement when evaluating on tasks unseen during fine-tuning, in contrast to SFT, which often degrades generalization (Wang et al., 2024).
Vision: CIFAR-100 and ImageNet: IC-KD surpasses representative contrastive and offline KD baselines by 0.1–1.9 pp across homogeneous and heterogeneous teacher-student pairs; similar trends hold for semantic segmentation (Cityscapes) (Zhu et al., 13 Jan 2025).
Model compression and efficiency: In context distillation for NLI, compressing OPT-1.3B into OPT-125M reduces size (2.5GB to 0.25GB) and memory footprint by approximately 60% while boosting out-of-domain accuracy by 50% over ICL and 20% over conventional pattern-based fine-tuning. Training time is reduced by over 8 $\times$ compared to pattern-based fine-tuning with increasing context length (Duan et al., 2024).

5. Algorithmic and Practical Considerations

IC-KD encompasses diverse strategies:

Loss mixing and hyperparameters: In NLP, combining teacher-driven cross-entropy with hard supervision and auxiliary LM loss yields maximal transfer, especially under multitask adaptation. Loss interpolation parameters ( $\alpha$ , $\beta$ ) and temperature scaling affect the balance of fit versus regularization (Huang et al., 2022, Duan et al., 2024).
Feature memory bank and retrieval: In vision applications, a memory bank storing teacher features enables retrieval of in-context samples for each query, supporting both offline and online KD variants. Positive/negative sample balancing, similarity normalization, and trade-off parameters impact efficacy (Zhu et al., 13 Jan 2025).
Optimization: LoRA low-rank adaptation (rank=32, $\alpha$ =64, dropout=0.05) is leveraged for parameter-efficient fine-tuning in SeCoKD. Student optimization typically uses AdamW with task-dependent learning rates and batch sizes (Wang et al., 2024, Duan et al., 2024).
Demonstration selection: Prompt-target alignment (minimizing MMD) directly controls initialization bias. Automated retrieval in feature space and prompt (re-)weighting enable improved deterministic selection versus random sampling (Li et al., 13 Jun 2025).
Context scaling: Empirically, context distillation approaches (e.g., SeCoKD and IC-KD) achieve robust performance up to moderate numbers of examples (e.g., $k=4$ ), with diminishing returns for longer prompts and little additional compute overhead (Wang et al., 2024, Duan et al., 2024).

6. Theoretical and Practical Unification; Limitations

IC-KD reconciles several strands of analysis:

Unification: Viewing in-context learning as inference-time KD recovers both gradient-based parameter adaptation and distributional shift (prompt-target match) frameworks, with generalization and bias rigorously controlled via Rademacher-complexity and kernel MMD bounds (Li et al., 13 Jun 2025).
Limitations: IC-KD requires teacher outputs on in-context batches, leading to computation overhead; memory bank construction in vision settings can be nontrivial. Studies to date focus on models $<$ 10B parameters and reasoning or classification tasks; extensions to generative and multilingual modeling remain open. Only self-distillation and offline teacher paradigms have been systematically evaluated. Statistical significance of gains is not universally reported (Wang et al., 2024, Duan et al., 2024, Zhu et al., 13 Jan 2025).
Extensibility: The principles underlying IC-KD—compression of context-dependence, robust smoothing via retrieved neighborhoods, and prompt-aligned loss—are applicable across architectures, modalities, and tasks. Methods such as combining intermediate representation matching with context-aware objectives, curriculum over $k$ , and dynamic demonstration retrieval are active research directions (Huang et al., 2022, Li et al., 13 Jun 2025).

7. Impact and Future Directions

IC-KD introduces a modality-agnostic paradigm for encoding and transferring task-specific, context-based knowledge from large, context-sensitive models to smaller, efficient, or otherwise constrained models. For language, IC-KD bridges the empirical gap between explicit prompt-based adaptation and parameter fine-tuning, producing models that are both few-shot competent and robust to distribution shift. In vision, IC-KD generalizes KD by label smoothing with teacher-driven neighborhoods in feature space, outperforming stronger pointwise and contrastive methods.

A plausible implication is that, as fast retrieval, adaptive prompting, and resource-efficient inference continue to gain importance, IC-KD frameworks that unify prompt-based and parametric adaptation will become a standard tool for model compression, on-device learning, and robust few-shot inference.

Key references include (Wang et al., 2024, Li et al., 13 Jun 2025, Huang et al., 2022, Duan et al., 2024, Zhu et al., 13 Jan 2025).