R1-Omni: Explainable Omni-Multimodal Emotion Recognition with Reinforcement Learning (2503.05379v2)

Published 7 Mar 2025 in cs.LG and cs.CV

Abstract: In this work, we present the first application of Reinforcement Learning with Verifiable Reward (RLVR) to an Omni-multimodal LLM in the context of emotion recognition, a task where both visual and audio modalities play crucial roles. We leverage RLVR to optimize the Omni model, significantly enhancing its performance in three key aspects: reasoning capability, emotion recognition accuracy, and generalization ability. The introduction of RLVR not only improves the model's overall performance on in-distribution data but also demonstrates superior robustness when evaluated on out-of-distribution datasets. More importantly, the improved reasoning capability enables clear analysis of the contributions of different modalities, particularly visual and audio information, in the emotion recognition process. This provides valuable insights into the optimization of multimodal LLMs.

Collections

Sign up for free to add this paper to one or more collections.

Sign Up

Summary

The paper introduces R1-Omni, employing a dual RL framework with RLVR and GRPO for explainable omni-multimodal emotion recognition.
Experimental results on MAFW, DFEW, and RAVDESS datasets show R1-Omni outperforms baselines, achieving enhanced accuracy and interpretability.
Key contributions include novel RLVR/GRPO methods, cold start fine-tuning, and reward decomposition, making the model more robust and interpretable for applications like HCI.

Methodologies

R1-Omni leverages a dual reinforcement learning framework to optimize an Omni-multimodal LLM for emotion recognition, integrating both video and audio streams. Central to this approach is the application of Reinforcement Learning with Verifiable Rewards (RLVR), which replaces the conventional, preference-based reward models (e.g., in RLHF) with a binary verifiable reward function that assigns a score of 1 or 0 based on the presence or absence of a predicted emotion match. The methodology further incorporates KL-divergence regularization to ensure that the learned policy remains close to the pre-trained HumanOmni-0.5B model. In tandem with RLVR, the framework employs Group Relative Policy Optimization (GRPO), a critic-free reinforcement learning scheme that evaluates multiple candidate responses by normalizing the reward signals across the group, thereby enhancing response quality discrimination.

Another key aspect is the deployment of a Cold Start Strategy. This entails a preliminary fine-tuning phase using a modest dataset—232 samples from the Explainable Multimodal Emotion Reasoning (EMER) dataset and 348 manually annotated samples—to imbue the model with an initial set of reasoning capabilities, thus laying the foundation for subsequent reinforcement learning optimization.

Additionally, the reward function is decomposed into two components: an accuracy reward (R_acc) to penalize incorrect emotion predictions, and a format reward (R_format) to enforce output consistency with a specified structured format (e.g., HTML-like tags > and <answer>). This decomposition ensures that both the correctness of the prediction and the interpretability of the output are simultaneously optimized.

Experimental Evaluation

The empirical evaluation is conducted on several benchmark datasets related to dynamic facial expression recognition and audio-visual emotion recognition. The principal datasets include.

MAFW (Multi-modal Affective Database for Dynamic Facial Expression Recognition)

DFEW (Dynamic Facial Expression in the Wild)

RAVDESS (The Ryerson Audio-Visual Database of Emotional Speech and Song), used as an out-of-distribution test set.

In the experimental protocol, the model is evaluated using the Open-Vocabulary Emotion Testing (OV-emotion) protocol. Performance is measured using Unweighted Average Recall (UAR) and Weighted Average Recall (WAR). Numerical results demonstrate that R1-Omni outperforms both the HumanOmni-0.5B baseline and other fine-tuning based models (EMER-SFT and MAFW-DFEW-SFT), achieving higher UAR and WAR scores, particularly notable for its robustness on the RAVDESS dataset, which indicates superior generalization to out-of-distribution data.

The combination of RLVR with GRPO proves effective in enhancing the model’s reasoning capabilities, as evidenced by the model's ability to output detailed, interpretable justifications for its emotion predictions. This capability aligns with a marked increase in both prediction accuracy and system interpretability, which is critical for applications in sensitive domains such as healthcare and human-computer interaction.

Key Contributions

The work introduces several technical innovations:

RLVR for Multimodal Emotion Recognition: This marks the first application of RLVR to a multimodal LLM in the context of emotion recognition, thereby sidestepping the need for human preference-based reward tuning. The binary reward mechanism simplifies the training objective and provides direct supervision over the output.

Integration of GRPO: By leveraging GRPO, the framework generates multiple candidate responses and normalizes rewards over a group, which allows for effective ranking and selection without the bias introduced by a conventional critic model. This mechanism enhances the sensitivity of the model to quality differentials among candidate outputs.

Cold Start Fine-Tuning: The initial fine-tuning using a small, targeted dataset improves the initial reasoning capabilities of the model, enabling a more stable reinforcement learning phase.

Reward Function Decomposition: Splitting the reward into R_acc and R_format components ensures that the model not only predicts the correct emotion but does so in a structured and interpretable format. This step is critical for maintainability and downstream applications that require a standardized output format.

These contributions collectively result in a model that is not only accurate in emotion recognition but also more robust and interpretable in its reasoning, particularly when handling heterogeneous data inputs.

Limitations and Future Directions

Despite its strengths, R1-Omni exhibits certain limitations that provide avenues for future work. Notable concerns include:

Inaccurate Subtitle Recognition: The model sometimes misinterprets subtitle information, which can adversely affect the multi-modal fusion process and thus the final emotion prediction.

Hallucination in Reasoning: While the reasoning pathway is considerably improved, there remain instances of hallucinated justifications that could mislead end users in critical applications where explanation fidelity is paramount.

Underutilization of Audio Cues: The current framework does not fully exploit the available audio information, signaling potential for further enhancements in the fusion strategy to balance the contributions of visual and audio modalities.

These limitations suggest future research directions aimed at refining the multi-modal integration process and enhancing the consistency and factuality of the model’s justifications.

Practical Implications and Real-World Deployment

For practical implementation, practitioners should note that the model’s training involves substantial computational resources, particularly due to the reinforcement learning components and the need for fine-grained reward normalization across candidate outputs. Deploying such a model in real-world settings demands rigorous evaluation on varying distributions of data to ensure robustness.

Deployments in sectors like human-computer interaction, healthcare diagnostics, and entertainment multimedia analysis stand to benefit from R1-Omni's enhanced accuracy and interpretability. However, attention must be paid to the potential for hallucinations and the limited audio integration, which may require bespoke tuning or additional modules targeting these specific challenges.

In summary, R1-Omni provides a technically robust and empirically validated framework for omni-multimodal emotion recognition, combining RLVR and GRPO to enhance both reasoning and predictive performance. The model establishes new performance baselines on challenging datasets and addresses several practical challenges inherent to multi-modal data integration. Continued research and careful tuning will further solidify its applicability in real-time, heterogeneous data environments.

PDF Markdown

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Related Papers

Find Related Papers

Authors (3)

Tweets

https://twitter.com/AdinaYakup/status/1899770776951128427

https://twitter.com/_akhaliq/status/1898942319086190679

https://twitter.com/BugNinza/status/1899168582581579789

https://twitter.com/ZiebaMat/status/1900206065884569992

https://twitter.com/Timothy_Hughes/status/1900184924797235626

https://twitter.com/taziku_co/status/1899425118805209372

Reddit

[2503.05379] R1-Omni: Explainable Omni-Multimodal Emotion Recognition with Reinforcement Learning: Alibaba just dropped R1-Omni! (1 point, 0 comments)

R1-Omni: Explainable Omni-Multimodal Emotion Recognition with Reinforcement Learning (2503.05379v2)

Collections

Summary

Methodologies

Experimental Evaluation

Key Contributions

Limitations and Future Directions

Practical Implications and Real-World Deployment

Follow-up Questions

Related Papers

Authors (3)

Tweets

Reddit

Don't miss out on important new AI/ML research