Graph Distillation Units in Multimodal Learning

Updated 11 March 2026

Graph Distillation Units (GD-Units) are dynamic, directed graph modules that enable adaptive, peer-to-peer knowledge transfer across modalities.
They decouple modality representations into homogeneous and heterogeneous subspaces, using dynamic graphs to assign teacher roles via softmax-normalized edge scores.
GD-Units improve performance in multimodal tasks like emotion recognition by facilitating fine-grained, context-aware fusion of language, vision, and acoustic features.

Graph Distillation Units (GD-Units) are dynamic, directed graph-based modules for adaptive crossmodal knowledge transfer in multimodal learning frameworks. Within the Decoupled Multimodal Distilling (DMD) approach for emotion recognition, each GD-Unit constructs a learnable graph over modality representations—specifically language, vision, and acoustic features—facilitating flexible, data-driven assignment of peer-to-peer "teacher" roles among modalities. The distillation process leverages both modality-invariant (homogeneous) and modality-exclusive (heterogeneous) feature decompositions, enabling new advances in fine-grained, context-aware fusion and knowledge transfer across heterogeneous data sources (Li et al., 2023).

1. Integration of GD-Units within DMD Architecture

The DMD framework addresses multimodal heterogeneity and varying informativeness by explicitly decoupling each modality's representations into two subspaces:

Modality-irrelevant ("homogeneous") subspaces $X_m^{com} = E^{com}(\tilde{X}_m)$ , aligning shared representations across modalities.
Modality-exclusive ("heterogeneous") subspaces $X_m^{prt} = E_m^{prt}(\tilde{X}_m)$ , retaining unique modality-specific information.

Two parallel GD-Units operate on these subspaces for each sample:

HomoGD acts directly on the homogeneous representations $\{X_L^{com}, X_V^{com}, X_A^{com}\}$ .
HeteroGD first applies pairwise cross-modal attention (via the MulT architecture) to obtain reinforced features $Z_{i \to j}^{prt}$ , and then distills across them.

Each GD-Unit outputs a scalar distillation loss—either $L_{dtl}^{homo}$ or $L_{dtl}^{hetero}$ —which is weighted by hyperparameter $\lambda_2$ and incorporated into the overall training objective.

2. Dynamic Graph Construction and Representation

Each GD-Unit constructs a directed graph $G = (V, E)$ for all $M$ modalities ( $M=3$ in DMD's typical setting):

Vertices: Each node $v_i$ $v_{i}$ holds a modality-specific feature (and later its predicted logits), forming the basis for distillation.
- HomoGD uses $(X_i^{com}, f(X_i^{com}))$ .
- HeteroGD uses $(Z_{i \to j}^{prt}, f(Z_{i \to j}^{prt}))$ .
Directed Edges: For every ordered pair $(i \to j)$ $(i \to j)$ ( $i \neq j$ $i \neq = j$ ):
- A learnable weight $w_{i \to j}$ (distillation strength).
- A pairwise distillation error $\epsilon_{i \to j}$ quantifying the output divergence between modalities $i$ and $j$ .

This dynamic graph structure supports non-symmetric, adaptive "teacher" assignments in multimodal distillation.

3. Mathematical Formalization

The core computations in each GD-Unit comprise:

Raw Edge Scores: For features $x_i \in \mathbb{R}^d$ $x_{i} \in R^{d}$ and corresponding logits $\ell_i = f(x_i; \theta_1) \in \mathbb{R}^C$ $ℓ_{i} = f (x_{i}; θ_{1}) \in R^{C}$ ,
- Concatenate $u_i = [\ell_i; x_i]$ and $u_j = [\ell_j; x_j]$ .
- Apply an edge MLP $g(\cdot;\theta_2)$ :
$s_{i\to j} = g([u_i, u_j]; \theta_2)$
Softmax Edge Normalization: For "incoming" edges to node $j$ ,

$W_{ij} = \frac{\exp(s_{i\to j})}{\sum_{k\neq j} \exp(s_{k\to j})}, \quad W_{jj}=0$

ensuring $\sum_{i\neq j} W_{ij} = 1$ and $W_{ij}\geq 0$ .

Distillation Errors: For each $(i,j)$ ,

$\epsilon_{i\to j} = \|\ell_i - \ell_j\|_1$

(alternatively, $L_2$ -norm or KL divergence).

GD-Unit Loss:

$L_{dtl} = \sum_{i,j} W_{ij} \cdot \epsilon_{i\to j} = \|W \odot E\|_1$

No additional regularizer is required due to the softmax-normalization's implicit regularization.

4. Learning, Optimization, and Backpropagation

All parameters for logit regressors ( $\theta_1$ ) and the edge MLP ( $\theta_2$ ) are optimized within the full training objective:

$L_{total} = L_{task} + \lambda_1 L_{dec} + \lambda_2 (L_{dtl}^{homo} + L_{dtl}^{hetero})$

Gradients flow from $L_{dtl}$ through both $W$ (and $s_{i\to j}$ via $\theta_2$ ) and $\epsilon_{i\to j}$ (via $\theta_1$ ). Automatic differentiation is directly applicable; no alternating or self-regressive optimization is required in the GD-Units themselves. The DMD framework's only "self-regression" occurs during the initial feature decoupling, not in the graph distillation step.

5. GD-Unit Forward and Backward Pass: Pseudocode

The following outlines a full pass, where $N$ is batch size and $M$ is modality count:

for i in range(1, M+1):
    ℓ[i] = f(x[i])  # N×C
for j in range(1, M+1):
    for i in range(1, M+1):
        if i != j:
            u_i = concat(ℓ[i], x[i])  # N×(C+d)
            u_j = concat(ℓ[j], x[j])
            s[i, j] = g(concat(u_i, u_j))  # N×1
    W[:, j] = softmax(s[:, j])  # across i
for all (i, j), i ≠ j:
    ε[i, j] = ||ℓ[i] – ℓ[j]||₁  # N×1
L_dtl = sum_{i ≠ j} W[i, j] * ε[i, j]

6. Empirical Patterns and Interpretations in Edge Weights

Visualization of learned edge weights on datasets such as CMU-MOSEI highlights emergent, data-driven teaching hierarchies among modalities:

HomoGD (homogeneous space): Edges $W_{L \to V}$ and $W_{L \to A}$ are large, indicating text (language) often acts as the principal teacher modality to vision and audio. Vision $\leftrightarrow$ Audio connections remain weak, reflecting text's informativeness in shared feature space.
HeteroGD (heterogeneous space with MulT-reinforced features): $W_{L \to V}$ and $W_{L \to A}$ remain high, but $W_{V \to A}$ also grows, as reinforced visual features (via cross-modal attention) become stronger teachers for audio.

These learned patterns demonstrate the GD-Unit's capacity for adaptive crossmodal knowledge transfer, dynamically discovering "who should teach whom" in each subspace rather than imposing static or a priori directionalities.

GD-Unit Type	Dominant Teacher Roles	Notable Secondary Patterns
HomoGD	Language $\to$ Vision/Audio	Weak Vision $\leftrightarrow$ Audio
HeteroGD	Language $\to$ Vision/Audio	Vision $\to$ Audio increases

7. Context, Significance, and Research Outlook

GD-Units introduce a principled, end-to-end learnable mechanism for fine-grained, peer-to-peer distillation among modalities, moving beyond static or symmetric crossmodal distillation schemas. By leveraging dynamic directed graphs whose edge weights are jointly learned with main task objectives, the DMD framework achieves superior performance in multimodal emotion recognition, substantiating the utility of flexible crossmodal knowledge transfer (Li et al., 2023).

This suggests that graph-based, adaptive distillation architectures may generalize to other multimodal or multi-view tasks, especially where relative modality informativeness is context-dependent. A plausible implication is the broader applicability of dynamic distillation graphs for robust information fusion in self-supervised, semi-supervised, or domain-adaptive multimodal systems.

Markdown Report Issue Upgrade to Chat

References (1)

Decoupled Multimodal Distilling for Emotion Recognition (2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Graph Distillation Units (GD-Units).