Papers
Topics
Authors
Recent
2000 character limit reached

SoundMind-RL: Audio Logical Reasoning

Updated 17 November 2025
  • SoundMind is a methodology that augments large audio-language models using rule-based reinforcement learning to improve logical reasoning over audio and mixed audio-text inputs.
  • It models reasoning trace generation as a finite Markov Decision Process and employs a composite reward function to ensure answer correctness, proper formatting, and detailed reasoning.
  • Empirical evaluations demonstrate that integrating RL fine-tuning yields consistent 3–4% accuracy improvements across modalities while maintaining balanced word error rates.

SoundMind is a methodology for augmenting large audio-LLMs (LALMs) with robust logical reasoning capabilities over audio and mixed audio-text inputs. The approach is instantiated as SoundMind-RL, a rule-based reinforcement learning (RL) algorithm that utilizes a high-quality, reasoning-centric benchmark dataset—SoundMind—to fine-tune the Qwen2.5-Omni-7B model. Through its design and experimental validation, SoundMind addresses the underexplored domain of audio logical reasoning (ALR) and establishes reproducible procedures for incentivizing both the accuracy and the explanatory depth of generated audio-language outputs.

1. Markov Decision Process Formulation

At the core of SoundMind-RL is the modeling of the reasoning trace generation as a finite Markov Decision Process (MDP). Each state sts_t at time tt is a tuple (x,o<t)(x,\,o_{<t}) where xx is the fixed audio input (as learned embeddings), and o<to_{<t} is the partial output sequence of tokens. The action space AA is the model’s output vocabulary, which varies between text tokens (for audio-to-text tasks) and audio tokens (for pure audio tasks).

The transition function T(st,at)=st+1T(s_t, a_t) = s_{t+1} extends o<to_{<t} by appending ata_t. An episode ends when a special EOS\langle \text{EOS} \rangle token is produced or when a length constraint is met. Reward R(s1:T,a1:T)R(s_{1:T}, a_{1:T}) is deferred until trajectory termination and is determined by a sum of rule-based scoring functions:

R(x,y)=kλkSk(x,y)y=o1:TR(x, y) = \sum_k \lambda_k S_k(x, y) \qquad y = o_{1:T}

The RL process thus optimizes the model to output sequences aligned with explicit, hand-specified reasoning desiderata.

2. Rule-Based Reward Structure

SoundMind-RL employs a composite reward function, defined as the weighted sum of five sub-scores relating to answer format, answer correctness, and reasoning length for both text and audio outputs. The reward components are:

  • Text answer-format correctness: Sformat(1)=λ1S_{\rm format}^{(1)} = \lambda_1 if “Answer:” appears in the last five text tokens, else 0.
  • Audio answer-format correctness: Sformat(2)=λ2S_{\rm format}^{(2)} = \lambda_2 if “Answer:” is verbalized near the end of the audio tokens, else 0.
  • Final-answer correctness: Sanswer=λ3S_{\rm answer} = \lambda_3 if the model’s answer matches the ground-truth, else 0.
  • Reasoning length (text): Slen(1)=λ4min(1,Lmodel/Lanno)S_{\rm len}^{(1)} = \lambda_4 \min(1, L_{\rm model} / L_{\rm anno}), where LmodelL_{\rm model} and LannoL_{\rm anno} are the respective lengths of generated and reference reasoning.
  • Reasoning length (audio): Slen(2)=λ5min(1,Tmodel/Tanno)S_{\rm len}^{(2)} = \lambda_5 \min(1, T_{\rm model} / T_{\rm anno}), comparing durations in time steps.

The weights are fixed: λ1=1.0\lambda_1=1.0, λ2=0.5\lambda_2=0.5, λ3=2.0\lambda_3=2.0, λ4=1.0\lambda_4=1.0, λ5=0.75\lambda_5=0.75. This explicit weighting encourages not only correct answers but also the production of interpretable and appropriately detailed reasoning chains in both modalities.

3. Fine-Tuning Pipeline

Model fine-tuning proceeds in two distinct phases:

1. Supervised Fine-Tuning (SFT):

Qwen2.5-Omni-7B is trained on the SoundMind dataset via cross-entropy minimization:

LSFT(θ)=E(x,y)Dt=1Tlogπθ(yty<t,x)L_{\rm SFT}(\theta) = -\mathbb{E}_{(x,y)\sim\mathcal{D}} \sum_{t=1}^T \log \pi_\theta(y_t \mid y_{<t}, x)

2. Reinforcement Learning via REINFORCE++:

A frozen SFT model acts as a “reference policy”, and the trainable model is updated with a clipped policy gradient objective enhanced by token-wise KL regularization with respect to the reference policy. The per-token importance weight is rt(θ)=πθ(atst)πθold(atst)r_t(\theta) = \frac{\pi_\theta(a_t \mid s_t)}{\pi_{\theta_{\rm old}}(a_t \mid s_t)}, the advantage is:

At=R(x,y)βi=tTKLiA_t = R(x, y) - \beta \sum_{i=t}^T \mathrm{KL}_i

with

KLi=logπθ(aisi)πSFT(aisi)\mathrm{KL}_i = \log \frac{\pi_\theta(a_i \mid s_i)}{\pi_{\rm SFT}(a_i \mid s_i)}

Normalized advantages A^t\hat{A}_t are used for the clipped surrogate loss:

JRL(θ)=E[min(rt(θ)A^t,clip(rt(θ),1ϵ,1+ϵ)A^t)]\mathcal{J}_{\rm RL}(\theta) = \mathbb{E}\left[ \min\left(r_t(\theta)\hat{A}_t, \, \mathrm{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon)\hat{A}_t \right) \right]

Loss LRL=JRLL_{\rm RL} = -\mathcal{J}_{\rm RL} is optimized, occasionally including an entropy bonus. Training follows standard hyperparameter settings from “Logic-RL.”

The following pseudocode summarizes the full procedure:

1
2
3
4
5
6
7
8
9
10
11
1. Initialize θ ← pretrained Qwen2.5-Omni-7B.
2. Supervised fine-tune on SoundMind: for N₁ steps, minimize L_SFT.
3. Copy θ_old←θ and store π_SFT=π_{θ_old}.
4. For N₂ RL steps:
   - Sample minibatch {x_k}.
   - For each x_k, generate y_k∼π_{θ_old}.
   - Compute R(x_k, y_k).
   - Compute per-token KL and advantage A_t^k.
   - Compute L_RL.
   - θ ← θ − η∇_θ L_RL.
   - Periodically set θ_old←θ.

4. Prompting and Model Architecture Considerations

The core Qwen2.5-Omni-7B transformer architecture is unmodified. Audio encoding and generation leverage its native speech-token interfaces. During both supervised and RL training, a fixed system prompt is prepended to all inputs, guiding the model to:

  • Produce chain-of-thought reasoning in natural language or speech;
  • Conclude with “Answer: <entailed/not-entailed>”;
  • Avoid LaTeX or markdown formatting.

Additional brief prompt fragments are injected before the premises and after the conclusion, reinforcing the two-choice format for entailment decisions. This explicit prompt engineering has empirically improved model adherence to the desired output format and reasoning style.

5. Experimental Outcomes

Evaluation on the SoundMind benchmark yields the following test-set accuracies (and Word Error Rates, WERs, for audio outputs):

Modality SFT Only SoundMind-RL
Audio→Text reasoning 77.59% 81.40%
Text→Audio reasoning 80.79% (WER 2.18%) 83.84% (WER 6.99%)
Audio→Audio reasoning 77.59% (WER 2.23%) 81.40% (WER 8.95%)

Ablation studies indicate that the removal of any single reward subcomponent results in substantial accuracy degradation (e.g., accuracy falls to 70.82% without audio rewards, 48.84% without text rewards, and 60.24% without the answer reward in audio→text tasks). Full reward integration achieves 81.40% accuracy. This suggests each term is independently critical to performance, and that RL fine-tuning atop SFT yields a consistent approximate 3–4 percentage point improvement across modalities.

6. Dataset and Resource Availability

The SoundMind dataset comprises 6,446 audio-text annotated samples engineered for high-complexity reasoning tasks. The dataset, codebase, and model checkpoints are hosted at https://github.com/xid32/SoundMind. The dataset serves both to advance ALR research and to facilitate fair benchmarking of future models and algorithms in audio-language reasoning tasks. The design explicitly fills a prior gap in reasoning-oriented multimodal datasets for the audio domain, enabling direct evaluation of complex inference abilities.

7. Significance and Research Context

SoundMind demonstrates that high-quality, reasoning-centric datasets combined with explicit, rule-based RL can significantly improve the logic reasoning capabilities of LALMs without modifying their underlying architectures. The algorithmic contributions and empirical results indicate that RL-incentivized logic can close performance gaps left by purely supervised fine-tuning. This framework broadens the operational scope of contemporary LALMs and supplies a reproducible blueprint for future research on auditory intelligence in language modeling, particularly concerning modalities and tasks underrepresented in standard training corpora.

A plausible implication is that explicit reward-shaping, as operationalized in SoundMind-RL, may generalize to other multimodal reasoning applications where format, correctness, and reasoning structure are critical to end-task performance.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to SoundMind Algorithm.