RL-CoMM: Reinforcement Learning for Multi-Modal Reasoning

Updated 16 November 2025

The paper demonstrates that RL-CoMM leverages step-wise reasoning rewards to effectively resolve audio-muted and audio-modified confusions in multimodal QA tasks.
Its two-stage pipeline integrates a Step-RR phase for aligning audio and visual reasoning with an Ans-CO phase to optimize answer confidence.
Benchmark results reveal significant accuracy improvements across AVQA and related datasets, highlighting its potential to mitigate visual dominance in MLLMs.

RL-CoMM (Reinforcement Learning–based Collaborative Multi-MLLM) is a reinforcement learning framework engineered to address modality conflict and reasoning bias in multimodal LLMs (MLLMs), particularly for scenarios where audio and visual information provide discordant cues. RL-CoMM was introduced to overcome the intrinsic visual dominance that conventional MLLMs exhibit when processing audio-visual question answering and hallucination tasks, such as identifying muted or tampered audio objects in multimodal video (Ye et al., 13 Nov 2025). The core innovation fuses reinforcement learning, external audio “reference” reasoning, and structured reward shaping to improve audio-visual correlation and answer confidence.

1. Motivation: Challenges in Audio-Visual Multimodal Reasoning

MLLMs like Qwen2.5-Omni and Gemini show a marked failure in “audio-visual confusion” settings, e.g., when a visually present object (such as a cello) is muted in the audio track. These models’ responses reflect a systematic bias toward the visual modality, often asserting the presence of a sound based on its visual depiction regardless of audio input. The two principal error types are:

Audio-muted confusion: The model fails to recognize that a visually depicted object produces no sound if its audio channel is suppressed.
Audio-modified confusion: The model neglects audio alterations (e.g., substituted sounds), again defaulting to visual priors.

This demonstrates the need for an architecture that explicitly calibrates reasoning across modalities and penalizes incoherence.

2. Framework Overview and Two-Stage Pipeline

RL-CoMM is implemented as a two-stage training and inference process, operating atop a strong MLLM base (Qwen2.5-Omni-3B), and guided by a “reference” audio-only reasoning agent—the Large Audio LLM (LALM, e.g., Qwen2-Audio).

Stage 1: Step-wise Reasoning Reward (Step-RR)

The policy model π_θ (MLLM) and reference π_ref (LALM) are used in tandem on each training instance:

Reference audio thinking: π_ref processes the video/audio input and generates an audio-only reasoning segment <a-think>…</a-think>.
Policy reasoning trace: π_θ performs multi-step generation including both audio and visual thinking <a-think>…</a-think><v-think>…</v-think><answer>…</answer>.
Three reward signals are computed per rollout:
- $r_\mathrm{format}$ : Checks output’s compliance with required markup format.
- $r_\mathrm{arr}$ (Audio Reasoning Rationality): Uses a semantic similarity function $S(\cdot, \cdot)$ to align the MLLM’s audio thinking to the reference.
- $r_\mathrm{avc}$ (Audio-Visual Correlation): Assesses coherence between the model’s audio and visual reasoning using scorer $I(\cdot, \cdot)$ , including bonuses for correct answers.

Normalized advantages drive policy gradient updates using a GRPO-style on-policy update without KL penalty.

Stage 2: Answer-centered Confidence Optimization (Ans-CO)

The model’s generated answers are further optimized to reduce uncertainty:

Confidence $C(a|x)$ is defined as $\pi_\theta(a|x)$ , the probability assigned to the answer span.
The loss jointly minimizes the negative log-likelihood of the answer tokens and an entropy penalty (with weight $\lambda$ ) on those tokens, discouraging diffuse probability mass and forcing confident predictions.

This stage alleviates residual uncertainty caused by modality transfer discrepancies between the audio reference and the MLLM.

3. Step-wise Reasoning Reward Design

The reward function for a group of trajectories $\{o^i\}_{i=1}^G$ produced by the policy model on query $q$ is

$R_\mathrm{step} = \sum_{i=1}^G r^i$

where

$r^i = r_\mathrm{format}^i + r_\mathrm{arr}^i + r_\mathrm{avc}^i$

$r_\mathrm{format}^i = 1$ if the generated output matches the <a-think>, <v-think>, and <answer> tag structure; otherwise $0$.
$r_\mathrm{arr}^i = 1$ if $S(o_1^i, o_\mathrm{ref}) > \omega$ (semantic similarity threshold, $\omega=0.8$ ) and the predicted answer is correct; otherwise $0$.
$r_\mathrm{avc}^i = 1 + I(o_1^i, o_2^i)$ if correct; $=I(o_1^i, o_2^i)$ if answer non-null but incorrect; $0$ otherwise.

Advantages are standardized:

$A^i = \frac{r^i - \text{mean}(r^\cdot)}{\text{std}(r^\cdot)}$

and used for standard policy gradient updates. This configuration encourages stepwise agreement with both the LALM guidance and internal A/V consistency.

4. Training Protocol, Data, and Infrastructure

The training pipeline is staged:

Warm-up SFT (Supervised Fine-Tuning): Fine-tune Qwen2.5-Omni-3B with ~100 high-quality multimodal QA pairs using LoRA; learning rate $\approx 5\times10^{-5}$ .
Step-RR RL:

RL on ~1,000 few-shot multimodal QA samples (from Music-AVQA, AVQA). - Rollout group size $G=4$ . - AdamW optimizer, RL learning rate $1\times 10^{-6}$ , reward threshold $\omega=0.8$ . - Iterations: $N\approx 3,000$ until convergence.

Ans-CO: Further optimize for answer confidence on the same data; entropy weight $\lambda=0.5$ , learning rate $5\times 10^{-6}$ , $1,000$ steps.

Infrastructure: 8×NVIDIA A800 GPUs, batch size 1 per GPU. Qwen3-Embedding is used for all semantic similarity and coherence scoring.

5. Evaluation Benchmarks and Quantitative Results

RL-CoMM was benchmarked on multiple datasets:

AV-ConfuseBench: Focused on audio-muted and audio-modified conditions (173 QA pairs total).
Music-AVQA, AVQA, Music-AVQA-R: Large-scale audio-visual QA with >200,000 QA pairs.
AVHBench: Audio-visual hallucination detection, including audio/video-driven hallucination and matching.

Key accuracy improvements (all with limited training data):

Benchmark	Baseline (Qwen2.5-Omni-3B)	RL-CoMM	Gain
AV-ConfuseBench (audio-muted)	8.22%	27.40%	+19.18 points
Music-AVQA ( avg accuracy )	70.41% (SFT)	79.46%	+9.05 points
AVQA ( avg accuracy )	83.78%	95.87%	+12.09 points
Music-AVQA-R (head accuracy)	61.42%	85.98%	+24.56 points
AVHBench (audio-driven halluc.)	65.85%	78.96%	+13.11 points

Qualitatively, RL-CoMM systematically corrects classic failure cases by using the audio-only “a-think” step to override conflicting visual priors. For example, in muted-instrument videos, baseline responses assert hearing the instrument, while RL-CoMM acknowledges audio absence at the reasoning stage and produces a correct answer.

6. Algorithmic Implementation and Applicability

The full RL-CoMM algorithm—prompt templates, reference-model orchestration, step-wise multi-reward RL update, and answer-token entropy regularization—can be directly adapted to any MLLM architecture capable of multi-modal input and sequential output generation. Implementation requires:

Multimodal input encoding (video, audio, question).
Stepwise structured output: separate <a-think>, <v-think>, and <answer> blocks.
Access to a Large Audio LLM for reference (off-line or in-loop).
Reward computation via pretrained embedding models for semantic and coherence scoring.
A two-phase optimization loop as described above.

Primary computational costs arise from multi-rollout RL (group size $G=4$ , per-sample launches), LALM inference, and embedding-based scorer evaluations. Real-world application is feasible with GPU clusters (as demonstrated with 8×A800 GPUs).

7. Significance and Implications

RL-CoMM provides an explicit mechanism to calibrate multimodal reasoning under modality clash, quantitatively reducing visual dominance and improving answer reliability. Its reward structure flexibly generalizes to other conflict-resolution scenarios across modalities. The use of external, unimodal experts as “reference modalities” is an extensible architectural principle for minimizing hallucination and bias in multimodal agents. A plausible implication is that similar RL-based collaborative schemes could be adapted for other forms of modality mismatch or sensor fusion beyond audio-visual QA.

PDF Markdown Chat (Pro)

References (1)

When Eyes and Ears Disagree: Can MLLMs Discern Audio-Visual Confusion? (2025)

Follow Topic

Get notified by email when new papers are published related to RL-CoMM.