Hierarchical Gated Fusion Decoder (HiGate)

Updated 24 December 2025

Hierarchical Gated Fusion Decoder (HiGate) is a multimodal mechanism that progressively injects gated information between modalities for improved active speaker detection.
It employs layer-wise fusion at designated Transformer depths, enabling fine-grained, context-aware integration of audio and visual features.
Empirical results show significant performance gains on ASD benchmarks with minimal computational overhead, validating its efficiency and effectiveness.

The Hierarchical Gated Fusion Decoder (HiGate) is an architectural component for multimodal learning that performs progressive, layer-wise gated information injection between modalities—specifically designed to address limitations of late fusion in tasks such as Active Speaker Detection (ASD). HiGate was introduced within the GateFusion model, which achieved state-of-the-art results on several ASD benchmarks by enabling fine-grained cross-modal interactions between strong pretrained audio and visual encoders (Wang et al., 17 Dec 2025).

1. Architectural Role and Integration

HiGate is situated at the interface between two strong, pretrained unimodal encoders—AV-HuBERT (video) and Whisper (audio)—and a lightweight classifier. Each encoder consists of a deep stack of Transformer or hybrid ResNet+Transformer layers, with respective hidden states $h_m^0, h_m^1, \ldots, h_m^L$ for modality $m \in \{v,a\}$ . Hidden states from the final encoder layer are projected to a shared feature width $F$ via a linear transformation:

$f_m = \phi(h_m^L) \in \mathbb{R}^{B \times T_m \times F}$

where $B$ is batch size and $T_m$ is the sequence length for modality $m$ .

HiGate does not simply fuse $f_v$ and $f_a$ at the end (“late fusion”), but instead refines the representation of a “primary” modality by repeatedly injecting gated information from several depths of the “context” modality’s encoder. This process is conducted symmetrically for both directions, generating enriched representations $\tilde f_v$ and $\tilde f_a$ , which are temporally aligned and summed before classification.

2. Gated Fusion Mechanism

For a given fusion step at context layer $l$ , HiGate operates as follows:

The hidden state $h_c^l$ from the context modality is projected to width $F$ and temporally aligned to the primary modality’s length, producing $\tilde h_c^l = \mathcal{I}(\phi(h_c^l), T_p)$ , where $\mathcal{I}$ denotes pooling or interpolation.
The primary feature $\tilde f_p$ and aligned context feature $\tilde h_c^l$ are concatenated and passed through a linear layer and sigmoid activation to generate a bimodally-conditioned gate:

$g^l = \sigma(W_g^l [\tilde f_p; \tilde h_c^l] + b_g^l), \quad W_g^l \in \mathbb{R}^{F \times 2F},\ b_g^l \in \mathbb{R}^{F}$

The context feature is injected, modulated by the gate, combined with the primary, and subjected to LayerNorm:

$\tilde f_p \leftarrow \mathrm{LN}(\tilde f_p + g^l \odot \tilde h_c^l)$

This process allows each fusion step to adaptively blend multimodal information based on both modalities’ current representations, promoting fine-grained, context-aware enrichment at each selected layer.

3. Progressive Multi-Depth Fusion Strategy

GateFusion utilizes HiGate at four designated encoder depths ( $N=4$ ): $\{1,4,7,10\}$ out of the first 12 Transformer layers. This “coarse-to-fine” schedule ensures progressive incorporation of low-, mid-, and high-level features from the context modality. The full fusion procedure for, e.g., video as primary and audio as context, is as follows:

Input: f_v, {h_a^l}_{l=1..L}, L_fuse={1,4,7,10}
Initialize: tilde_f_v ← f_v

for l in L_fuse do
    h̃_a ← I(φ(h_a^l), T_v)             # project & align
    g^l  ← σ( W_g^l [tilde_f_v; h̃_a] + b_g^l )
    tilde_f_v ← LN( tilde_f_v + g^l ⊙ h̃_a )
end for

Output: enriched tilde_f_v

A symmetric process is applied to produce

\tilde f_a

via fusion of video hidden states into audio. The final multimodal representation is the sum of temporally-aligned

\tilde f_v

and

\tilde f_a

, which is classified via a two-layer MLP.

4. Auxiliary Objectives and Loss Interaction

To improve the robustness and consistency of unimodal branches, HiGate incorporates two auxiliary objectives in addition to the main multimodal cross-entropy loss.

Masked Alignment Loss (MAL): For each positive frame (speaker present), MAL computes KL divergences between the unimodal outputs and the joint multimodal prediction, masking terms to only consider active speech intervals:

$\mathcal L_{\mathrm{MAL}} = \frac{1}{2|S|} \sum_{i \in S} (\mathrm{KL}_a^{(i)} + \mathrm{KL}_v^{(i)})$

where $S$ indexes positive frames and $\mathrm{KL}_a^{(i)}$ (resp. $\mathrm{KL}_v^{(i)}$ ) measures divergence between joint output and audio (resp. video) predictions.

Over-Positive Penalty (OPP): To suppress spurious video-only activations when no speech is present,

$\mathcal L_{\mathrm{OPP}} = \frac{1}{M} \sum_{i=1}^{M} p_v^{(i)}[1] (1 - y^{(i)})$

where $p_v^{(i)}[1]$ is the video branch’s positive class probability on frame $i$ and $y^{(i)}$ is the ground truth.

Total loss: The joint loss is:

$\mathcal L = \mathcal L_{\mathrm{CLS}} + \lambda_{\mathrm{MAL}} \mathcal L_{\mathrm{MAL}} + \lambda_{\mathrm{OPP}} \mathcal L_{\mathrm{OPP}}$

with $\lambda_{\mathrm{MAL}} = 0.01$ , $\lambda_{\mathrm{OPP}} = 0.1$ .

These auxiliary terms encourage unimodal and multimodal congruence during speaking intervals and penalize silent-frame false positives from visual signals.

5. Model Capacity, Training, and Computational Footprint

HiGate introduces additional parameters primarily via $N=4$ gating modules of size $F=1280$ , for a total augmentation of approximately $13$ million parameters over the baseline. The computational overhead is moderate:

VRAM usage increases by approximately $8\%$ compared to a simple late-fusion scheme.
Inference speed drops by $10$– $15\%$ relative to late-fusion, but remains $75\%$ faster than attention-heavy decoders such as LoCoNet.

Training employs AdamW optimizer, with encoder learning rate $5 \times 10^{-5}$ and decoder $1 \times 10^{-4}$ , using a batch size of $1,500$ frames and a truncated $L=12$ Transformer layer regime. Learning rates are decayed by $0.95$ every $3,000$ steps for a total of $30,000$ steps.

6. Empirical Performance and Context

HiGate demonstrated substantial empirical improvements for ASD, including increases of $+9.4\%$ mAP (to $77.8\%$ ) on Ego4D-ASD, $+2.9\%$ (to $86.1\%$ ) on UniTalk, and $+0.5\%$ (to $96.1\%$ ) on WASD. Relative to late fusion, it preserved fine-grained information and enabled multi-level context-aware enrichment, as evidenced by notable gains (e.g., $+5.3$ mAP on Ego4D-ASD). The model also showed robust generalization across domains.

A plausible implication is that HiGate’s progressive, layerwise cross-modal gating paradigm can be adapted for other multimodal tasks where late fusion bottlenecks fine-grained interaction and temporal alignment. The auxiliary objective integration further suggests avenues for principled unimodal-multimodal consistency regularization in related applications (Wang et al., 17 Dec 2025).

PDF Markdown Chat (Pro)

References (1)

GateFusion: Hierarchical Gated Cross-Modal Fusion for Active Speaker Detection (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Hierarchical Gated Fusion Decoder (HiGate).

Hierarchical Gated Fusion Decoder (HiGate)

1. Architectural Role and Integration

2. Gated Fusion Mechanism

3. Progressive Multi-Depth Fusion Strategy

4. Auxiliary Objectives and Loss Interaction

5. Model Capacity, Training, and Computational Footprint

6. Empirical Performance and Context

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Hierarchical Gated Fusion Decoder (HiGate)

1. Architectural Role and Integration

2. Gated Fusion Mechanism

3. Progressive Multi-Depth Fusion Strategy

4. Auxiliary Objectives and Loss Interaction

5. Model Capacity, Training, and Computational Footprint

6. Empirical Performance and Context

Sponsor

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research