Papers
Topics
Authors
Recent
2000 character limit reached

Hierarchical Gated Fusion Decoder (HiGate)

Updated 24 December 2025
  • Hierarchical Gated Fusion Decoder (HiGate) is a multimodal mechanism that progressively injects gated information between modalities for improved active speaker detection.
  • It employs layer-wise fusion at designated Transformer depths, enabling fine-grained, context-aware integration of audio and visual features.
  • Empirical results show significant performance gains on ASD benchmarks with minimal computational overhead, validating its efficiency and effectiveness.

The Hierarchical Gated Fusion Decoder (HiGate) is an architectural component for multimodal learning that performs progressive, layer-wise gated information injection between modalities—specifically designed to address limitations of late fusion in tasks such as Active Speaker Detection (ASD). HiGate was introduced within the GateFusion model, which achieved state-of-the-art results on several ASD benchmarks by enabling fine-grained cross-modal interactions between strong pretrained audio and visual encoders (Wang et al., 17 Dec 2025).

1. Architectural Role and Integration

HiGate is situated at the interface between two strong, pretrained unimodal encoders—AV-HuBERT (video) and Whisper (audio)—and a lightweight classifier. Each encoder consists of a deep stack of Transformer or hybrid ResNet+Transformer layers, with respective hidden states hm0,hm1,,hmLh_m^0, h_m^1, \ldots, h_m^L for modality m{v,a}m \in \{v,a\}. Hidden states from the final encoder layer are projected to a shared feature width FF via a linear transformation:

fm=ϕ(hmL)RB×Tm×Ff_m = \phi(h_m^L) \in \mathbb{R}^{B \times T_m \times F}

where BB is batch size and TmT_m is the sequence length for modality mm.

HiGate does not simply fuse fvf_v and faf_a at the end (“late fusion”), but instead refines the representation of a “primary” modality by repeatedly injecting gated information from several depths of the “context” modality’s encoder. This process is conducted symmetrically for both directions, generating enriched representations f~v\tilde f_v and f~a\tilde f_a, which are temporally aligned and summed before classification.

2. Gated Fusion Mechanism

For a given fusion step at context layer ll, HiGate operates as follows:

  • The hidden state hclh_c^l from the context modality is projected to width FF and temporally aligned to the primary modality’s length, producing h~cl=I(ϕ(hcl),Tp)\tilde h_c^l = \mathcal{I}(\phi(h_c^l), T_p), where I\mathcal{I} denotes pooling or interpolation.
  • The primary feature f~p\tilde f_p and aligned context feature h~cl\tilde h_c^l are concatenated and passed through a linear layer and sigmoid activation to generate a bimodally-conditioned gate:

gl=σ(Wgl[f~p;h~cl]+bgl),WglRF×2F, bglRFg^l = \sigma(W_g^l [\tilde f_p; \tilde h_c^l] + b_g^l), \quad W_g^l \in \mathbb{R}^{F \times 2F},\ b_g^l \in \mathbb{R}^{F}

  • The context feature is injected, modulated by the gate, combined with the primary, and subjected to LayerNorm:

f~pLN(f~p+glh~cl)\tilde f_p \leftarrow \mathrm{LN}(\tilde f_p + g^l \odot \tilde h_c^l)

This process allows each fusion step to adaptively blend multimodal information based on both modalities’ current representations, promoting fine-grained, context-aware enrichment at each selected layer.

3. Progressive Multi-Depth Fusion Strategy

GateFusion utilizes HiGate at four designated encoder depths (N=4N=4): {1,4,7,10}\{1,4,7,10\} out of the first 12 Transformer layers. This “coarse-to-fine” schedule ensures progressive incorporation of low-, mid-, and high-level features from the context modality. The full fusion procedure for, e.g., video as primary and audio as context, is as follows:

1
2
3
4
5
6
7
8
9
10
Input: f_v, {h_a^l}_{l=1..L}, L_fuse={1,4,7,10}
Initialize: tilde_f_v ← f_v

for l in L_fuse do
    h̃_a ← I(φ(h_a^l), T_v)             # project & align
    g^l  ← σ( W_g^l [tilde_f_v; h̃_a] + b_g^l )
    tilde_f_v ← LN( tilde_f_v + g^l ⊙ h̃_a )
end for

Output: enriched tilde_f_v
A symmetric process is applied to produce f~a\tilde f_a via fusion of video hidden states into audio. The final multimodal representation is the sum of temporally-aligned f~v\tilde f_v and f~a\tilde f_a, which is classified via a two-layer MLP.

4. Auxiliary Objectives and Loss Interaction

To improve the robustness and consistency of unimodal branches, HiGate incorporates two auxiliary objectives in addition to the main multimodal cross-entropy loss.

  • Masked Alignment Loss (MAL): For each positive frame (speaker present), MAL computes KL divergences between the unimodal outputs and the joint multimodal prediction, masking terms to only consider active speech intervals:

LMAL=12SiS(KLa(i)+KLv(i))\mathcal L_{\mathrm{MAL}} = \frac{1}{2|S|} \sum_{i \in S} (\mathrm{KL}_a^{(i)} + \mathrm{KL}_v^{(i)})

where SS indexes positive frames and KLa(i)\mathrm{KL}_a^{(i)} (resp. KLv(i)\mathrm{KL}_v^{(i)}) measures divergence between joint output and audio (resp. video) predictions.

  • Over-Positive Penalty (OPP): To suppress spurious video-only activations when no speech is present,

LOPP=1Mi=1Mpv(i)[1](1y(i))\mathcal L_{\mathrm{OPP}} = \frac{1}{M} \sum_{i=1}^{M} p_v^{(i)}[1] (1 - y^{(i)})

where pv(i)[1]p_v^{(i)}[1] is the video branch’s positive class probability on frame ii and y(i)y^{(i)} is the ground truth.

  • Total loss: The joint loss is:

L=LCLS+λMALLMAL+λOPPLOPP\mathcal L = \mathcal L_{\mathrm{CLS}} + \lambda_{\mathrm{MAL}} \mathcal L_{\mathrm{MAL}} + \lambda_{\mathrm{OPP}} \mathcal L_{\mathrm{OPP}}

with λMAL=0.01\lambda_{\mathrm{MAL}} = 0.01, λOPP=0.1\lambda_{\mathrm{OPP}} = 0.1.

These auxiliary terms encourage unimodal and multimodal congruence during speaking intervals and penalize silent-frame false positives from visual signals.

5. Model Capacity, Training, and Computational Footprint

HiGate introduces additional parameters primarily via N=4N=4 gating modules of size F=1280F=1280, for a total augmentation of approximately $13$ million parameters over the baseline. The computational overhead is moderate:

  • VRAM usage increases by approximately 8%8\% compared to a simple late-fusion scheme.
  • Inference speed drops by $10$–15%15\% relative to late-fusion, but remains 75%75\% faster than attention-heavy decoders such as LoCoNet.

Training employs AdamW optimizer, with encoder learning rate 5×1055 \times 10^{-5} and decoder 1×1041 \times 10^{-4}, using a batch size of $1,500$ frames and a truncated L=12L=12 Transformer layer regime. Learning rates are decayed by $0.95$ every $3,000$ steps for a total of $30,000$ steps.

6. Empirical Performance and Context

HiGate demonstrated substantial empirical improvements for ASD, including increases of +9.4%+9.4\% mAP (to 77.8%77.8\%) on Ego4D-ASD, +2.9%+2.9\% (to 86.1%86.1\%) on UniTalk, and +0.5%+0.5\% (to 96.1%96.1\%) on WASD. Relative to late fusion, it preserved fine-grained information and enabled multi-level context-aware enrichment, as evidenced by notable gains (e.g., +5.3+5.3 mAP on Ego4D-ASD). The model also showed robust generalization across domains.

A plausible implication is that HiGate’s progressive, layerwise cross-modal gating paradigm can be adapted for other multimodal tasks where late fusion bottlenecks fine-grained interaction and temporal alignment. The auxiliary objective integration further suggests avenues for principled unimodal-multimodal consistency regularization in related applications (Wang et al., 17 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Hierarchical Gated Fusion Decoder (HiGate).

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube