SoundMind: RL-Incentivized Logic Reasoning for Audio-Language Models (2506.12935v1)

Published 15 Jun 2025 in cs.CL, cs.MM, cs.SD, and eess.AS

Abstract: While LLMs have shown reasoning capabilities, their application to the audio modality, particularly in large audio-LLMs (ALMs), remains significantly underdeveloped. Addressing this gap requires a systematic approach, involving a capable base model, high-quality reasoning-oriented audio data, and effective training algorithms. In this study, we present a comprehensive solution: we introduce the Audio Logical Reasoning (ALR) dataset, consisting of 6,446 text-audio annotated samples specifically designed for complex reasoning tasks. Building on this resource, we propose SoundMind, a rule-based reinforcement learning (RL) algorithm tailored to endow ALMs with deep bimodal reasoning abilities. By training Qwen2.5-Omni-7B on the ALR dataset using SoundMind, our approach achieves state-of-the-art performance in audio logical reasoning. This work highlights the impact of combining high-quality, reasoning-focused datasets with specialized RL techniques, advancing the frontier of auditory intelligence in LLMs. Our code and the proposed dataset are available at https://github.com/xid32/SoundMind.

Collections

Sign up for free to add this paper to one or more collections.

Sign Up

Summary

The paper introduces a reinforcement learning framework and novel dual-modality dataset to boost logical reasoning in audio-language models.
It leverages the ALR dataset with 6,446 samples and over 1,000 hours of audio to generate coherent, step-by-step reasoning in both text and speech.
The SoundMind algorithm uses rule-based rewards with REINFORCE++ optimization to achieve a +3.81% accuracy improvement in audio-to-text reasoning tasks.

Large audio-LLMs (ALMs) currently lag behind LLMs and visual-LLMs (VLMs) in their reasoning capabilities, particularly in generating coherent step-by-step reasoning sequences for audio inputs. This gap is attributed to the limited availability of high-quality audio datasets designed for complex reasoning tasks and the technical challenges of maintaining reasoning consistency during audio generation. The paper "SoundMind: RL-Incentivized Logic Reasoning for Audio-LLMs" (2506.12935) addresses this by introducing a new dataset and a tailored reinforcement learning (RL) algorithm.

The proposed solution consists of two key components:

The Audio Logical Reasoning (ALR) Dataset: A novel, dual-modality dataset designed specifically for training and evaluating ALMs on logical reasoning tasks. It contains 6,446 samples derived from the LogiQA 2.0-NLI dataset (Grubišić-Čabo et al., 2023), enriched with both textual and corresponding audio annotations. Each sample includes:
- User content: A natural language prompt containing a logical triplet (major premise, minor premise, conclusion), available in both text and synthesized speech.
- Chain-of-Thought (CoT) reasoning: Step-by-step explanations generated by an LLM (DeepSeek-R1 (DeepSeek-AI et al., 22 Jan 2025)), provided in both text and synthesized speech.
- Final answer: The logical conclusion ("entailed" or "not-entailed"), also in text and speech. This dual-modality annotation is critical for training ALMs that can understand audio inputs and generate reasoned responses in either text or audio format. The dataset includes approximately 1,074 hours of audio, with average output durations (for reasoning) around 10 minutes, posing a significant challenge for long-form audio generation. The data generation pipeline involves colloquializing logical triplets, generating CoT and answers using an LLM, and synthesizing audio using a text-to-speech (TTS) model like MegaTTS 3 (Jiang et al., 26 Feb 2025).
The SoundMind Algorithm: A rule-based reinforcement learning framework tailored to improve the logical reasoning capabilities of ALMs using the ALR dataset. It builds upon principles from logic-focused RL methods (like Logic-RL (Xie et al., 20 Feb 2025)) and utilizes the REINFORCE++ algorithm (Hu et al., 4 Jan 2025) for optimization. The core idea is to define structured rewards that incentivize the model to produce logically correct, well-formatted, and appropriately detailed reasoning outputs across modalities. The reward function components include:
- Answer Format Correctness: Rewards for correctly placing the "Answer:" token and the final answer within the last few tokens/audio segments of the response. Separate rewards ( $S_{\text{format}}^{(1)}, S_{\text{format}}^{(2)}$ ) are defined for text and audio outputs.
- Answer Correctness: Rewards for matching the ground truth logical conclusion ( $S_{\text{answer}}$ ).
- Reasoning Length Evaluation: Rewards based on the ratio of the generated text/audio length to the reference annotation length ( $S_{\text{len}}^{(1)}, S_{\text{len}}^{(2)}$ ). This encourages the model to produce reasoning of comparable depth to the human-annotated (LLM-generated) CoT. The total reward is a weighted sum of these components. SoundMind fine-tunes a base ALM (Qwen2.5-Omni-7B (Xu et al., 26 Mar 2025) in this work) by optimizing its policy using REINFORCE++, a critic-free policy gradient method with PPO-style clipping and a token-level KL penalty to prevent divergence from a supervised fine-tuned (SFT) reference model.

Practical Implementation and Applications:

Implementing SoundMind involves several practical steps:

Data Preparation:
- Acquire or generate the ALR dataset following the described pipeline. This requires access to a source of logical reasoning problems (like LogiQA 2.0-NLI), a powerful LLM for CoT generation, and a high-quality TTS model for synthesizing audio.
- Ensure proper alignment between text and audio segments in the dataset.
Base Model Selection and Setup:
- Choose a capable base ALM that supports processing audio inputs and generating either text or audio outputs (e.g., Qwen2.5-Omni-7B).
- Optionally, perform supervised fine-tuning (SFT) on the ALR dataset text-only version or a similar dataset as a warm-start and reference model for the KL penalty in RL.
Reward Function Implementation:
- Translate the rule-based reward components ( $S_{\text{format}}, S_{\text{answer}}, S_{\text{len}}$ ) into code. This involves parsing the model's generated output (both text and audio/transcripts) to check format, extract the answer, and measure length.
- Define the weighting coefficients ( $\lambda_1, \dots, \lambda_5$ ) for the reward components. These may require tuning based on the specific task and desired behavior.
RL Training Setup:
- Implement the REINFORCE++ optimization loop. This involves:
  - Sampling sequences from the current policy ( $\pi_\theta$ ).
  - Calculating the composite reward for each generated sequence.
  - Computing the token-level KL divergence against the SFT reference model ( $\pi_{\mathrm{SFT}}$ ).
  - Calculating advantages using the reward and KL penalty.
  - Normalizing advantages.
  - Computing the clipped surrogate loss.
  - Performing policy updates using gradients.
- Set other RL hyperparameters like the KL penalty coefficient ( $\beta$ ) and the clipping parameter ( $\epsilon$ ).
- Parallelize training across multiple GPUs (as done in the paper with H800s) due to the computational intensity of training large ALMs and generating long sequences.
Evaluation:
- Evaluate the trained model on the ALR test set across the three modality settings (A2T, T2A, A2A).
- Metrics include accuracy for logical correctness and Word Error Rate (WER) for audio output quality (in T2A and A2A).

The SoundMind approach demonstrates significant performance improvements (e.g., +3.81% accuracy on A2T) over the SFT baseline and other ALMs across audio-to-text, text-to-audio, and audio-to-audio logical reasoning tasks. This indicates that the RL framework effectively trains the ALM to internalize and apply logical rules even when the input and output are entirely in the audio domain.

Real-World Applications:

Intelligent Voice Assistants: Enable assistants to understand complex spoken requests requiring logical inference and provide detailed, reasoned spoken responses.
Audio Content Analysis: Develop systems that can listen to and analyze audio content (e.g., debates, lectures, interviews), extract logical arguments, and potentially summarize or explain the reasoning steps involved, either textually or through synthesized speech.
Educational Tools: Create interactive tools that can help users practice logical reasoning via spoken dialogue, providing step-by-step audio feedback on their thinking process.
Accessibility: Improve accessibility tools for individuals with visual impairments by enabling deeper auditory understanding and generation of complex information that requires logical processing.

Implementation Considerations and Trade-offs:

Computational Cost: Training ALMs with RL on large datasets like ALR is computationally expensive, requiring high-end GPUs and significant training time.
Data Quality: The performance heavily relies on the quality of the ALR dataset, including the accuracy of the CoT annotations and the naturalness of the synthesized audio. Errors in the training data can propagate.
Hyperparameter Tuning: Finding the right balance for reward weights ( $\lambda$ values) and RL hyperparameters ( $\beta$ , $\epsilon$ ) is crucial for stable training and optimal performance.
WER Trade-off: While accuracy improves, the paper notes a moderate increase in WER for audio outputs. This suggests a potential trade-off between maximizing logical correctness and maintaining perfect speech fluency. Future work might explore ways to balance these objectives.
Generalization: While tested on logical reasoning, adapting SoundMind to other audio reasoning tasks would require developing task-specific datasets and potentially custom reward functions.

The release of the ALR dataset and the SoundMind code base provides valuable resources for researchers and practitioners aiming to build ALMs with advanced reasoning capabilities, paving the way for more intelligent and interactive audio-based AI systems.

PDF Markdown

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Related Papers

Authors (9)

GitHub

GitHub - xid32/SoundMind: We introduce SoundMind, a novel rule-based reinforcement learning framework that empowers largescale audio-language models with advanced logical reasoning capabilities across both audio and textual modalities. To enable such training, we build the Audio Logical Reasoning (ALR) dataset, a dual-modality benchmark comprising 6,446 highquality samples. (71 stars)