Papers
Topics
Authors
Recent
2000 character limit reached

CanKD: Cross-Attention Non-local KD

Updated 29 November 2025
  • The paper introduces the CanKD framework that leverages pixel-wise cross-attention between teacher and student features for non-local information transfer.
  • It integrates a specialized Can block with a normalized feature matching loss to enhance object detection and semantic segmentation accuracy.
  • Empirical results demonstrate competitive performance gains across various architectures and tasks, confirming the framework's effectiveness.

Cross-Attention-based Non-local Knowledge Distillation (CanKD) is a feature-based knowledge distillation framework for vision tasks that introduces a non-local cross-attention mechanism between teacher and student feature representations. In contrast to traditional self-attention-based distillation frameworks, CanKD explicitly relates every pixel in the student network’s intermediate feature map to all pixels in the teacher’s feature map. This approach enables non-local information transfer and leverages richer pixel-wise dependencies, enhancing the student’s representation quality in object detection and semantic segmentation scenarios. The core contribution of CanKD is the integration of a Cross-Attention Non-local block (“Can” block) and a normalized feature-matching objective, providing competitive performance gains with minimal architectural disruption (Sun et al., 26 Nov 2025).

1. Theoretical Foundations and Motivation

Knowledge Distillation (KD) aims to transfer the representation or predictive capabilities of a large, high-performance teacher network to a compact student model, typically by aligning outputs (logits), intermediate features, or attention maps. Traditional feature-based KD methods often utilize self-attention or local feature alignment, where student and teacher feature maps are compared or regularized independently at corresponding spatial locations. However, these approaches may ignore non-local contextual cues and long-range pixel relationships present in the teacher's representation.

CanKD builds upon the non-local neural network operations introduced by Wang et al. (2018), extending them from intra-model (self) attention to inter-model (cross) attention. Each student feature position (pixel) attends to all teacher positions, allowing for the capture of long-range pixel dependencies and facilitating a more holistic knowledge transfer. Unlike methods using static or teacher-driven feature selection (e.g., ACAM-KD (Lan et al., 8 Mar 2025)), CanKD’s cross-attention explicitly enables dynamic, pixel-wise interactions between modalities.

2. Architectural Principles and Can Block Mechanism

The CanKD framework is inserted between the backbone and head of the student network (e.g., between FPN and detection head or encoder-decoder and segmentation head). Key components include:

  • Feature Sequence Conversion: For features at an intermediate layer, teacher FTRH×W×CF_T \in \mathbb{R}^{H \times W \times C} and student FSRH×W×CF_S \in \mathbb{R}^{H \times W \times C} are reshaped to sequences X=[xi]i=1...NX = [x_i]_{i=1...N}, Y=[yj]j=1...NY = [y_j]_{j=1...N}, with N=HWN = H \cdot W.
  • Cross-Attention Non-local Operation:
    • Queries: θ(xi)=Wθxi\theta(x_i) = W_\theta x_i, Keys: ϕ(yj)=Wϕyj\phi(y_j) = W_\phi y_j, Values: g(yj)=Wgyjg(y_j) = W_g y_j with WθW_\theta, WϕW_\phi, WgRC×CW_g \in \mathbb{R}^{C' \times C}, CCC' \leq C (typically C=C/2C' = C/2).
    • Affinity: αi,j=θ(xi)Tϕ(yj)\alpha_{i,j} = \theta(x_i)^T \phi(y_j) (dot-product, no softmax).
    • Aggregation: zi=1Nj=1Nαi,jg(yj)z_i = \frac{1}{N} \sum_{j=1}^N \alpha_{i,j} g(y_j), Z=[zi]i=1...NRN×CZ = [z_i]_{i=1...N} \in \mathbb{R}^{N \times C'}.
    • Output: FS=reshape(WZZ)+FSF_S^* = \text{reshape}(W_Z Z) + F_S, with WZRC×CW_Z \in \mathbb{R}^{C \times C'} and a residual connection.
  • Projection and Down-sampling: Optional spatial down-sampling (e.g., 2×2\times max-pool) of teacher keys and values reduces computational complexity from O(N2)O(N^2) (where N=HWN=H\cdot W) to O((N/4)2)O((N/4)^2).

3. Loss Formulation and Optimization

CanKD introduces an additional feature-based distillation loss atop the standard task loss:

  • Instance-normalized Feature Matching: For each sample, instance normalization Ω()\Omega(\cdot) is applied to features:

Lfeat=Ω(FT)Ω(FS)22L_\text{feat} = \|\Omega(F_T) - \Omega(F_S^*)\|^2_2

  • Joint Objective: Student parameters are optimized via:

Lstudent=Ltask+μLfeatL_\text{student} = L_\text{task} + \mu L_\text{feat}

where LtaskL_\text{task} is the canonical task loss (e.g., detection, segmentation), and μ\mu is a balancing hyperparameter (μ=5\mu=5 for detection, μ=10\mu=10 for segmentation).

  • Optimization Protocol: Training employs SGD (momentum 0.9, weight decay 1×1041\times10^{-4}), with schedule:
    • Detection (COCO): lr=0.005, 2× MMDetection schedule (\approx24 epochs).
    • Segmentation (Cityscapes): lr=0.01, 80k iterations in MMSegmentation.

4. Empirical Performance and Comparative Benchmarks

CanKD’s effectiveness has been demonstrated across various tasks, architectures, and datasets:

Object Detection (COCO val2017, RetinaNet-R50 Student)

Method Distill AP ΔAP
FGD [Yang et al.] feature 39.6 +2.2
DiffKD [Huang et al.] feat+logit 39.7 +2.3
CanKD (ours) feature 39.7 +2.3
CanKD† (better init) feature 39.8 +2.4

Similar results are observed for Faster R-CNN, FCOS, and RepPoints. With a more powerful teacher (ResNeXt-101), CanKD reaches 41.0–42.4 AP, matching or surpassing prior hybrid and feature-based KD baselines.

Heterogeneous Knowledge Transfer

Student: RetinaNet-R50; Teacher: FCOS-X101

Method AP ΔAP
PKD 40.3 +2.9
DetKDS 40.4 +3.0
CanKD 40.5 +3.1

Semantic Segmentation (Cityscapes, PSPNet-R18 Backbone)

Method mIoU (%)
CWD 75.54
MGD 75.84
CanKD 76.24

Consistent gains (+0.4 to +1.8 mIoU) are obtained for DeepLabV3-R18 and MobileNetV2 backbones.

Foundation Models and Large-scale Datasets

  • Dino-R50 (teacher Dino-R101): +0.7 AP over baseline
  • GroundingDino-R50: +0.9 AP
  • Object365v2 RetinaNet-R50: 18.4 AP (+1.7 over baseline)

5. Ablation and Qualitative Analysis

Ablation studies elucidate key aspects:

  • Can Block and Instance-norm: Joint usage yields maximal AP improvements (baseline: 38.6, L2 loss only: 41.4, +instance-norm: 41.7, full CanKD: 42.4).
  • Affinity Function: Dot-product affinity (no softmax, as in CanKD) outperforms Gaussian and embedded Gaussian.
  • Loss Weight: μ=5\mu=5 yields best detection results (AP=42.4); segmentation prefers μ=10\mu=10.
  • Qualitative Visualizations: Student activation heatmaps post-CanKD more closely match teacher peak responses. Segmentation boundaries are crisper, and misclassifications are reduced.

6. Relation to Other Non-local and Attention-guided KD Methods

CanKD distinguishes itself from concurrent approaches such as ACAM-KD (Lan et al., 8 Mar 2025), which also employs cross-attention between student and teacher but combines it with adaptive spatial-channel masking and teacher-driven importance selection. ACAM-KD fuses features (FT,FSF^T, F^S) via cross-attention for mask generation, whereas CanKD injects a cross-attention block directly into the student backbone, producing enhanced features and enforcing instance-normalized 2\ell_2 loss.

A plausible implication is that CanKD’s direct pixel-wise cross-modal attention is more agnostic to static teacher-driven selection biases, and by leveraging all teacher pixel information, captures richer non-local dependencies.

7. Limitations and Prospects

CanKD’s computational footprint is moderate, with extra FLOPs only from lightweight 1×1 convolution layers and possible spatial pooling to reduce quadratic complexity. The approach generalizes across detector and segmenter backbones, and is compatible with foundation models. However, as with non-local blocks generally, scalability to ultra-high-resolution features may require further engineering or sparsity.

The framework provides a new paradigm for feature-based knowledge distillation, setting a precedent for future attention-guided KD work seeking non-local, cross-modal pixel relationships. Extensions could explore multi-head attention generalizations, integration with adaptive masking, and application to domains beyond vision.

References

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Cross-Attention-based Non-local Knowledge Distillation (CanKD).