Fork-Merge Decoding for AV-LLMs
- FMD is an inference-time algorithm that addresses modality bias by separating early unimodal reasoning from later joint fusion.
- It employs a fork phase for isolated audio and visual processing, followed by a merge phase that combines these modalities without retraining.
- Empirical results show FMD enhances cross-modal accuracy and reduces hallucinations using attention-guided fusion in both token-wise and channel-wise settings.
Fork-Merge Decoding (FMD) is an inference-time algorithm designed to address modality bias in audio-visual LLMs (AV-LLMs). Modality bias arises when joint processing of audio and visual inputs in existing AV-LLMs causes the model to over-rely on the most informative or easily learned modality (often visual), leading to suboptimal utilization of complementary cues and, in some cases, hallucinations. FMD introduces modality-specific reasoning by initially processing audio-only and video-only inputs through the early decoder layers—a stage termed the "fork phase"—and subsequently merging the resulting hidden states for joint reasoning in the later layers, known as the "merge phase." This approach requires no additional training or architectural modification and is compatible with a variety of decoding paradigms, improving balanced multimodal understanding in AV-LLMs such as VideoLLaMA2 and video-SALMONN without decreasing inference efficiency (Jung et al., 27 May 2025).
1. Motivation and Theoretical Foundations
Modern AV-LLMs typically concatenate audio and video features alongside text tokens, jointly encoding these modalities from the decoder's input layer onward. This joint fusion, while efficient for unified multimodal reasoning, often induces a bias: the model prefers the "dominant" or most discriminative modality, thus compromising the fidelity of contributions from the other modality. The consequence is an imbalanced feature attribution and potential hallucinations in tasks requiring nuanced audio-visual integration.
Fork-Merge Decoding addresses this challenge by enforcing unimodal reasoning prior to fusion. By isolating modalities during the initial decoder layers, FMD forces the model to extract salient information from both audio and video streams independently. The subsequent merge enables the model to leverage cross-modal complementarities in higher layers, resulting in more robust and balanced decision making. This functional separation requires no retraining or architecture changes and operates entirely at inference via input manipulation and strategically timed merging of states (Jung et al., 27 May 2025).
2. Algorithmic Details
Inputs and Parameters
- Decoder: Pretrained AV-LLM decoder φ with layers.
- Inputs: Video-frame embeddings (length ), audio embeddings (length ), and text embeddings (length ).
- Fork depth: , number of early layers assigned to unimodal reasoning (e.g., 5 for VideoLLaMA2, 8 for video-SALMONN).
- Fusion weight: (e.g., 0.8; optionally attention-guided).
Procedural Steps (Token-wise Fusion Case)
- Input Construction: Concatenate embeddings: .
- Masked Inputs: Create two variants:
- 0
- 1
- Fork Phase: Independently process both masked sequences through early decoder layers:
- 2
- 3
- Fusion Weight Calculation (Optional): Attention-guided 4 derived from last-layer attention distributions:
- 5
- 6
- 7
- Merge Phase: Form 8 by combining corresponding token positions:
- For 9 (video): 0
- For 1 (audio): 2
- For 3 (text): 4
- Joint Decoding: Continue decoding from 5 to 6 layers:
- 7
- Output logits for next-token prediction.
For channel-wise fusion (video-SALMONN), 8 if 9 (audio-visual), and 0 for text tokens.
3. Mathematical Formalization
Let 1 denote the concatenated input. Masked versions are 2 and 3.
Early layer outputs:
- 4
- 5
Attention-guided fusion (Eqns. 5–7 in the original paper):
- 6
- 7
- 8
Merge equations:
- For 9: 0
- For 1: 2
- For 3: 4
Final output:
- 5 logits 6 next token.
4. Hyperparameterization and Ablation Findings
FMD depends primarily on two key hyperparameters: the fork depth 7 and the fusion weight 8. Their empirical tuning determines the algorithm’s balance between unimodal and multimodal integration.
- Fork depth (9): For VideoLLaMA2 (0), 1 was optimal. For video-SALMONN (2), 3 yielded best results. Deeper merging (higher 4) increases audio attention but can reduce video-task performance; too shallow merging leads to insufficient cross-modal fusion.
- Fusion weight (5): Set to 0.8, derived as a rounded mean over 100 AVHBench samples. Ablation indicates exclusion strategies (6 or 7) provide modest gains or none, uniform averaging (8) decreases performance, while attention-guided fusion consistently yields improvements.
- Fusion type: Both token-wise (VideoLLaMA2) and channel-wise (video-SALMONN) fusion are supported. Optimal strategy depends on the model architecture.
Table: Summary of Key Hyperparameter Choices and Effects
| Hyperparameter | Best Value (VideoLLaMA2) | Best Value (video-SALMONN) |
|---|---|---|
| Fork Depth (9) | 5 | 8 |
| Fusion Weight (0) | 0.8 (attention-guided) | 0.8 (attention-guided) |
5. Quantitative Performance and Comparative Results
FMD consistently improves accuracy and balance across multiple task categories. The following summarizes the reported improvements on benchmark datasets using VideoLLaMA2 (token-wise fusion) and video-SALMONN (channel-wise fusion):
| Task | VideoLLaMA2 Vanilla | VideoLLaMA2 FMD | video-SALMONN Vanilla | video-SALMONN FMD |
|---|---|---|---|---|
| AVQA | 82.46 | 82.74 (+0.28) | 36.36 | 36.89 (+0.53) |
| MUSIC-AVQA | 81.30 | 81.57 (+0.27) | 44.48 | 44.78 (+0.30) |
| A→V | 80.02 | 80.34 (+0.32) | 68.69 | 71.05 (+2.36) |
| V→A | 77.03 | 77.70 (+0.67) | 62.39 | 65.31 (+2.92) |
| AV-match | 57.75 | 58.89 (+1.14) | 49.46 | 51.49 (+2.03) |
| Captioning | 2.84 | 2.94 (+0.10) | 1.83 | 1.89 (+0.06) |
Relative to other decoding schemes (such as DoLa, VCD, SID), FMD zero-out masking achieves the strongest overall gains, particularly in cross-modal transfer tasks (e.g., A→V, V→A). Input perturbation ablation shows that zero-masking (rather than additive noise) better isolates modalities and yields superior performance (Jung et al., 27 May 2025). Inference efficiency is maintained (1 tokens/sec), with FMD comparable to or faster than many baseline schemes.
6. Layerwise Attention, Fusion Strategies, and Empirical Insights
Analysis of layerwise attention reveals that the point of merging (i.e., 2) modulates the model’s modality attribution: deeper merges bias attention towards audio, while early merges favor video. FMD narrows the audio-video attention gap, promoting balanced feature exploitation. Attention-guided fusion outperforms both hard exclusion and uniform averaging approaches, as demonstrated empirically.
Ablation across merge layers indicates that 3 in VideoLLaMA2 achieves the best trade-off for both unimodal and multimodal performance. FMD offers robust plug-and-play capabilities, yielding consistent accuracy improvements across varying model, task, and input settings without adverse effects on generation speed or architectural complexity (Jung et al., 27 May 2025).
7. Applications and Implications
FMD is applicable wherever balanced utilization of audio and visual cues is critical, such as audio-visual question answering, cross-modal reasoning, and AV-matching tasks in video-language understanding. Its inference-only, training-free nature makes it especially attractive for deployment settings where retraining or architectural modification is impractical. The approach generalizes across AV-LLMs with both token-wise and channel-wise fusion pipelines and offers empirically validated improvements on standard benchmarks. A plausible implication is that similar fork-merge paradigms could extend to other multimodal domains, mitigating bias and enhancing the interpretability of large-scale fusion models (Jung et al., 27 May 2025).