Data-Balanced Curriculum Learning for Audio Question Answering (2507.06815v1)

Published 9 Jul 2025 in cs.SD and eess.AS

Abstract: Audio question answering (AQA) requires models to understand acoustic content and perform complex reasoning. Current models struggle with dataset imbalances and unstable training dynamics. This work combines curriculum learning with statistical data balancing to address these challenges. The method labels question difficulty using LLMs, then trains progressively from easy to hard examples. Statistical filtering removes overrepresented audio categories, and guided decoding constrains outputs to valid multiple-choice formats. Experiments on the DCASE 2025 training set and five additional public datasets show that data curation improves accuracy by 11.7% over baseline models, achieving 64.2% on the DCASE 2025 benchmark.

Collections

Sign up for free to add this paper to one or more collections.

Sign Up

Summary

The paper introduces a multi-component framework combining curriculum learning, statistical data balancing, and guided decoding to boost Audio QA accuracy by 11.7% over baselines.
The methodology addresses dataset imbalance by filtering overrepresented audio categories and stabilizes training with curriculum-guided reinforcement learning.
The results highlight that robust data curation techniques can surpass complex algorithmic tweaks in enhancing model generalization for audio question answering.

Data-Balanced Curriculum Learning for Audio Question Answering

This paper addresses the persistent challenges in Audio Question Answering (AQA), specifically the issues of dataset imbalance and unstable training dynamics in large-scale audio-LLMs. The authors propose a method that integrates curriculum learning, statistical data balancing, and guided decoding within a hybrid supervised and reinforcement learning framework. The approach is evaluated on the DCASE 2025 Task 5 benchmark and five additional public datasets, demonstrating a substantial improvement in accuracy over baseline models.

Methodological Contributions

The core methodological innovations are as follows:

Curriculum-Guided Reinforcement Learning: The model is trained progressively, starting with easy examples and gradually incorporating more difficult ones. Question difficulty is automatically labeled using a small LLM (Phi-4-mini-instruct), enabling a principled curriculum schedule. This approach is designed to stabilize the reward signals during reinforcement learning, particularly in the early training phases.
Statistical Data Balancing: The authors introduce a statistical filtering mechanism to address severe class imbalances in audio datasets. By computing the mean and standard deviation of category counts, overrepresented categories (e.g., human sounds, mixed environments) are filtered out using a tunable threshold ( $\theta$ ). This ensures a more uniform distribution of audio categories, which is shown to be critical for generalization, especially to rare acoustic events.
Hybrid SFT-GRPO Training Pipeline: The training pipeline combines supervised fine-tuning (SFT) with Low-Rank Adaptation (LoRA) for efficient parameter updates, followed by Group Relative Policy Optimization (GRPO) for reinforcement learning. SFT provides a stable initialization, while GRPO optimizes for a composite reward function that includes both answer accuracy and output format validation.
Guided Decoding: During GRPO, output generation is constrained using regular expressions compiled into finite state machines. This ensures that the model produces valid multiple-choice answers (A, B, C, D) and eliminates the need for post-processing.

Experimental Results

The method is evaluated on the DCASE 2025 Task 5 benchmark, which includes three sub-tasks: bioacoustics QA, temporal soundscapes QA, and complex QA. The training set is augmented with five additional datasets (AVQA, ClothoAQA, CompA-Order, TACOS, AudSem), each contributing different types of audio-question pairs.

Key results include:

Accuracy Improvement: The proposed method achieves 64.2% accuracy on the DCASE 2025 benchmark, representing an 11.7% improvement over the best baseline (Gemini-2.0-Flash at 52.5%).
Component Analysis: Ablation studies reveal that statistical diversity balancing ( $\theta=0.7$ ) yields the largest single-model performance gain. Curriculum learning on easy samples provides modest improvements, while reinforcement learning (GRPO) alone does not outperform supervised fine-tuning unless combined with data curation.
Task Difficulty: Temporal and counting tasks (Part 2) remain the most challenging, with accuracies between 38-42%, indicating that temporal reasoning in audio-LLMs is still an open problem.
Reasoning Phase: Incorporating explicit reasoning steps (e.g., chain-of-thought) does not improve performance unless the training data is comprehensively annotated for reasoning, highlighting a limitation in current dataset resources.

Implications and Discussion

The findings underscore several important implications for the development and deployment of audio-LLMs:

Data Quality Over Algorithmic Complexity: The most significant performance gains are attributed to data curation—specifically, balancing category distributions—rather than to the choice of advanced training algorithms. This suggests that future progress in AQA may depend more on dataset engineering than on model architecture.
Generalization and Robustness: By mitigating overfitting to dominant categories, statistical balancing enhances the model's ability to generalize to rare and diverse acoustic events, which is essential for real-world deployment in heterogeneous environments.
Reinforcement Learning Limitations: The limited gains from GRPO, unless paired with curriculum and data balancing, highlight the need for more robust reward formulations and denser feedback mechanisms in audio-language tasks.
Guided Decoding as a Practical Tool: The use of regular-expression-based guided decoding is shown to be effective for enforcing output constraints in multiple-choice settings, with minimal computational overhead.

Future Directions

The results motivate several avenues for future research:

Learned Data Curation: Moving beyond predefined categories, future work could explore learned representations for data balancing, potentially leveraging unsupervised or self-supervised clustering of audio features.
Reward Engineering: Developing richer and more informative reward signals for reinforcement learning in AQA, possibly incorporating intermediate reasoning steps or human-in-the-loop feedback.
Dataset Expansion: Creating and annotating datasets with explicit reasoning steps to enable more effective training of models capable of chain-of-thought reasoning in the audio domain.
Transfer to Other Modalities: The principles of curriculum learning and statistical balancing may be transferable to other multimodal reasoning tasks, such as video question answering or cross-modal retrieval.

Conclusion

This work demonstrates that careful data curation—specifically, statistical balancing of audio categories—can yield substantial improvements in audio question answering, surpassing gains from more complex training algorithms. The integration of curriculum learning and guided decoding further stabilizes training and ensures output validity. These findings suggest that future advances in AQA will likely be driven by innovations in dataset construction and curation, with implications for the broader field of multimodal AI.

PDF Markdown

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Related Papers

Authors (4)

alphaXiv

Data-Balanced Curriculum Learning for Audio Question Answering (2 likes, 0 questions)