Graph-on-Logits Distillation

Updated 16 April 2026

Graph-on-Logits Distillation is a method that transfers structured knowledge by aligning the relational structure encoded in model logits using graph constructions.
It employs techniques like Proxy Relational Graph Distillation and Gromov–Wasserstein alignment to improve teacher-student consistency across vision, language, and graph tasks.
Practical applications include zero-shot classification, language model fusion, and node classification, yielding notable accuracy improvements over classical distillation.

Graph-on-Logits Distillation (GLD) refers to a collection of methods that seek to transfer structured knowledge between models by aligning or propagating the relational structure encoded in their output prediction logits, typically via explicit graph constructions. These approaches contrast with classical knowledge distillation, which often treats logits as flat vectors and matches them via standard losses (e.g., Kullback–Leibler divergence), thereby discarding or neglecting inter-output dependencies that may reflect semantic, relational, or structural information. GLD operates across multiple domains, including vision, language, and knowledge graphs, and is instantiated by diverse mechanisms such as sample–class relational graphs, co-activation graphs, propagation-augmented logit smoothing, and selective co-distillation. The central objective is to enhance the fidelity of the student’s internal representation of relationships, leading to improved performance in challenging distillation and fusion regimes.

1. Core Principles and Problem Motivation

Graph-on-Logits Distillation arises from the need to transfer higher-order semantic structure, not just pointwise output probabilities, between a teacher and a student. When distilling from large models—such as large foundation models (LFMs), graph neural networks (GNNs), or pre-trained LLMs (PLMs)—naive matching of output logits can propagate task-irrelevant noise or fail to capture the complex dependencies among output classes or tokens. For example, in zero-shot classification with CLIP, the feature space is dense and contains substantial domain-irrelevant variation; in LLM fusion, token-wise outputs neglect cross-token co-activation; in semi-supervised node classification, traditional distillation ignores graph topology.

GLD frameworks thus build explicit graphs over logits or feature-logit composites—encoding samples, output targets, or vocabulary tokens as nodes, and representing meaningful structural or relational information as edges. Distillation then targets graph-level or relation-preserving alignment, generally aiming to filter out nuisance information and enforce task-relevant structural transfer (Xu et al., 2024, Wang et al., 20 May 2025, Shin et al., 2023, Liu et al., 2022).

2. Major Methodologies

2.1 Proxy Relational Graph Distillation

In “PRG: Prompt-Based Distillation Without Annotation via Proxy Relational Graph” (Xu et al., 2024), the Proxy Relational Graph (PRG) method addresses two limitations of relying solely on frozen LFM embeddings: the leakage of task-irrelevant knowledge and the high feature density that hinders discriminative capacity. PRG proceeds as follows:

Teacher Logit Extraction: For $c$ classes and $p$ prompt templates per class, PRG computes weighted average logits from CLIP zero-shot prompt embeddings. Given image embedding $I_f \in \mathbb{R}^d$ and text prompt encodings $T \in \mathbb{R}^{c \times p \times d}$ , the final teacher logits are:

$W = \sum_{i=1}^{p} w_i W_i ,\quad \text{with}\quad w_i = \frac{\max(W_i)}{\sum_{j=1}^p \max(W_j)}$

where $W_i = T_i \cdot I_f$ and weights $w_i$ reflect prompt confidence.

Graph Construction: Both teacher and student form bipartite graphs in each minibatch $\mathcal{G} = (N,E)$ with sample nodes and class proxy nodes. For each sample, PRG concatenates image features with logits: $f^t_i = [I_{f,i}^\top, W_i^\top ]^\top$ for the teacher; for the student, backbone outputs are MLP-mapped and concatenated with student logits.
Class Proxy Nodes: There are $c$ learnable proxy vectors, updated per iteration using batch samples predicted as each class. Update follows:

$p$ 0

where $p$ 1 is a small step size.

Edge Computation: Edges connect samples to proxies with weights given by Pearson correlations:

$p$ 2

building adjacency matrices $p$ 3.

Alignment Objectives: Two losses guide learning:
- Node alignment (sample-node embeddings): $p$ 4 where $p$ 5 is the cross-correlation matrix between teacher and student batch nodes.
- Edge alignment (sample–proxy relationships): $p$ 6.

The combined loss:

$p$ 7

with empirically selected weights ( $p$ 8).

This approach explicitly aligns the structural distribution of class relationships between teacher and student, and restricts distillation to class-centric cues, suppressing domain-irrelevant correlations and mitigating the curse of high-dimensional feature spaces (Xu et al., 2024).

2.2 Co-activation Graph Distillation via Gromov–Wasserstein Alignment

"InfiGFusion: Graph-on-Logits Distillation via Efficient Gromov-Wasserstein for Model Fusion" (Wang et al., 20 May 2025) generalizes GLD to large-scale model fusion in language modeling. The method constructs global co-activation graphs from the output logits:

Graph Construction: For each example, keep top- $p$ 9 logits at every sequence position, aggregating their outer products to a global co-activation matrix $I_f \in \mathbb{R}^d$ 0:

$I_f \in \mathbb{R}^d$ 1

where $I_f \in \mathbb{R}^d$ 2 is the sparse vector of top-k logits for sequence position $I_f \in \mathbb{R}^d$ 3.

Graph Alignment via Gromov–Wasserstein: The core objective aligns graphs from student and teacher by minimizing a Gromov–Wasserstein distance between their co-activation matrices. An efficient $I_f \in \mathbb{R}^d$ $I_{f} \in R^{d}$ 4 sorting-based deterministic approximation is used:
- Compress each graph to node-level degree features $I_f \in \mathbb{R}^d$ 5 (averaged edge weights).
- Sort degree feature vectors and apply one-to-one matching: $I_f \in \mathbb{R}^d$ 6
- where $I_f \in \mathbb{R}^d$ 7.

This structure-aware fusion loss yields improved transfer of relational dependencies (e.g., cross-token interactions), outperforming baseline logit-based and basic fusion methods across a range of reasoning, mathematics, and coding benchmarks (Wang et al., 20 May 2025).

2.3 Propagation-Embracing Logit Distillation

"Propagate & Distill" (Shin et al., 2023) investigates semi-supervised node classification, proposing to distill both feature transformation and the graph propagation phases of a GNN into a student MLP:

Propagation Smoothing: Rather than directly inverting the propagation operator, the method recursively propagates the teacher’s logits via a K-step personalized PageRank smoother, resulting in a graph-structure-aware teacher target:

$I_f \in \mathbb{R}^d$ 8

The student is then trained via KL divergence to these smoothed, structure-injected teacher probabilities.

This approach affords computational efficiency and enables MLP students to mimic graph-based relational consistency, without direct access to graph structure at inference.

2.4 Co-distillation with Partial Logit Transfer in Knowledge Graph Embedding

"I Know What You Do Not Know" (Liu et al., 2022) introduces CoLE, which trains a graph embedding model (N-Former) and a text-based prompt-tuned PLM (N-BERT) jointly, distilling only the top-confidence regions of each model’s logits—a selective, graph-on-logits approach:

For each triple, both models extract the top-50% of entity logits to form teacher subsets for bidirectional KD losses.
This selectivity avoids transferring uncertain or noisy predictions across modalities and enhances link-prediction, especially in sparse or ambiguous contexts.

3. Algorithmic Workflow and Pseudocode Overview

GLD methods typically involve the following workflow:

For each sample or minibatch, extract teacher logits—often filtered, weighted, or associated with side information (e.g., prompt context, propagation smoothed).
Construct explicit graphs:
- Nodes: examples, classes, proxies, tokens, or output channels.
- Edges: semantic, correlation, or co-activation relations.
Compute graph-level features: degree profiles, adjacency matrices, or cross-node correlations.
Align graphs using targeted objectives: node alignment, edge correlation matching, Gromov–Wasserstein, or partial KL divergence.
Combine GLD with standard KD or supervised losses.
Optimize over student (and sometimes proxy) parameters, freezing or partially updating teacher/proxy representations.

The following table summarizes representative graph-on-logits instantiations:

Method	Graph Nodes	Edges / Graph Summary	Alignment Objective
PRG (Xu et al., 2024)	Samples, class proxies	Corr(sample, proxy)	Node: embedding corr.; Edge: adj. matrix L2
InfiGFusion (Wang et al., 20 May 2025)	Vocabulary tokens (top-k logits)	Co-activation (outer product)	Gromov–Wasserstein (sorted deg. diff)
P&D (Shin et al., 2023)	Nodes in input graph	Topology via propagation	KL to propagated teacher
CoLE (Liu et al., 2022)	Entities	Top-confidence logits (partial)	KL over subset-marginals

4. Experimental Results and Empirical Findings

Graph-on-Logits Distillation yields substantial gains across tasks and architectures:

In vision classification, PRG achieves top-1 accuracy of 76.23% (teacher: 77.9%) on CIFAR-100 and 72.44% (teacher: 75.3%) on ImageNet-1K, outperforming Hinton-style KD and closing the gap to zero-shot CLIP. On fine-grained recognition (StanfordCars), PRG (77.32%) even surpasses the teacher (77.3%), attributed to effective filtering of task-irrelevant cues (Xu et al., 2024).
In LLM fusion, InfiGFusion (GLD+ULD+SFT) improves over the pivot model by +2.49 accuracy on average, with dramatic gains for complex reasoning: +37.6 points on Multistep Arithmetic and +34.2 on Causal Judgement (Wang et al., 20 May 2025).
For transductive node classification, propagation-enhanced logit smoothing enables student MLPs to match or nearly match the accuracy of GNN teachers, with P&D yielding 82.2–82.3% (Cora), 74.9% (CiteSeer), and 78.1% (Pubmed) (Shin et al., 2023).
In knowledge graph embedding, CoLE’s co-distillation delivers state-of-the-art Hits@1 and MRR on filtered FB15K-237 and WN18RR, specifically by targeting high-confidence regions (Liu et al., 2022).

5. Comparative Analysis and Design Choices

Distinct GLD variants emphasize different facets of graph-structured alignment:

Sample–proxy graphs (PRG) restrict structural transfer to class relations, excising distractors and background.
Co-activation graphs (InfiGFusion) directly encode token–token co-occurrence, capturing higher-order syntactic and compositional dependencies missed by per-token KL-based fusion.
Propagation-based smoothing (P&D) leverages the relational prior of the underlying graph, enabling non-graph-aware students to mimic propagation-based consistency.
Selective co-distillation (CoLE) recognizes the harm of wholesale logit transfer in heterogeneous model ensembles and enforces competitive, region-wise graph alignment.

A plausible implication is that the effectiveness of GLD hinges critically on both the definition of structural edges and the strategy for filtering/reweighting logits, which together determine the degree to which only task-relevant relations are transferred.

6. Applications, Limitations, and Outlook

GLD methods are now integral in:

Annotation-free knowledge transfer: PRG enables efficient distillation without labels, leveraging zero-shot teacher predictions (Xu et al., 2024).
Model fusion across diverse pre-trained sources: InfiGFusion demonstrates robust fusion of natively heterogeneous LLMs, enhancing zero-shot and multi-step reasoning (Wang et al., 20 May 2025).
Graph representation learning: Propagate & Distill makes MLPs competitive with GNNs for graph benchmarks while maintaining inference efficiency (Shin et al., 2023).
Hybrid embedding and relational modeling: CoLE outperforms prior KG embedding methods by combining structural and textual modalities, especially benefiting long-tail and low-resource entities (Liu et al., 2022).

Reported limitations include the risk of oversmoothing in propagation-based techniques, the need for tuning structural hyperparameters (e.g., propagation depth, top-k selection), and degradation in non-homophilic graphs. Current research explores learnable propagation rules, multi-scale structural alignments, and heterophily-aware extensions. The progression of GLD points to increasing sophistication in structured distillation, with cross-domain relevance spanning vision, natural language processing, and relational learning.