ARES: Alternating Reinforcement Learning and Supervised Fine-Tuning for Enhanced Multi-Modal Chain-of-Thought Reasoning Through Diverse AI Feedback (2407.00087v2)

Published 25 Jun 2024 in cs.AI, cs.CL, and cs.LG

Abstract: Large Multimodal Models (LMMs) excel at comprehending human instructions and demonstrate remarkable results across a broad spectrum of tasks. Reinforcement Learning from Human Feedback (RLHF) and AI Feedback (RLAIF) further refine LLMs by aligning them with specific preferences. These methods primarily use ranking-based feedback for entire generations. With advanced AI models (Teacher), such as GPT-4 and Claude 3 Opus, we can request various types of detailed feedback that are expensive for humans to provide. We propose a two-stage algorithm ARES that Alternates REinforcement Learning (RL) and Supervised Fine-Tuning (SFT). First, we request the Teacher to score how much each sentence contributes to solving the problem in a Chain-of-Thought (CoT). This sentence-level feedback allows us to consider individual valuable segments, providing more granular rewards for the RL procedure. Second, we ask the Teacher to correct the wrong reasoning after the RL stage. The RL procedure requires massive efforts for hyperparameter tuning and often generates errors like repetitive words and incomplete sentences. With the correction feedback, we stabilize the RL fine-tuned model through SFT. We conduct experiments on multi-model dataset ScienceQA and A-OKVQA to demonstrate the effectiveness of our proposal. ARES rationale reasoning achieves around 70% win rate against baseline models judged by GPT-4o. Additionally, we observe that the improved rationale reasoning leads to a 2.5% increase in inference answer accuracy on average for the multi-modal datasets.

Summary

The paper introduces ARES, which alternates reinforcement learning and supervised fine-tuning to enhance multi-modal chain-of-thought reasoning.
It leverages detailed sentence-level AI feedback to stabilize and correct reasoning outputs, addressing common RL instabilities.
ARES demonstrated a 70% win rate and a 2.5% accuracy increase on ScienceQA and A-OKVQA, marking significant performance improvements.

The paper introduces ARES, an innovative approach aimed at improving the reasoning abilities of Large Multimodal Models (LMMs) through a dual-phase process involving Reinforcement Learning (RL) and Supervised Fine-Tuning (SFT). The backdrop of this research rests on the capabilities of LMMs, which excel at understanding instructions and deliver exceptional results across multifaceted tasks. ARES seeks to refine these models further by leveraging feedback from both human sources and advanced AI systems, and it is particularly focused on enhancing the multi-modal Chain-of-Thought (CoT) reasoning.

Methodology

ARES operates through a two-stage algorithm that alternates between RL and SFT to optimize model performance. The first stage employs Reinforcement Learning, which is fine-tuned using newly introduced sentence-level nuanced feedback. By requesting advanced AI systems, such as GPT-4, to score each sentence's contribution to the overall problem-solving process, the RL stage benefits from granular, sentence-specific feedback that goes beyond typical sample ranking methods. This detailed feedback aids the model in distinguishing which components of its reasoning chains are most valuable, thus aligning its outputs more closely with desired reasoning trajectories.

As the RL procedure is often fraught with challenges such as generating repetitive outputs and showing high sensitivity to hyperparameter settings, the subsequent SFT stage addresses these instabilities. In this second stage, the ARES framework seeks correction feedback from advanced AI models to amend incorrect or incomplete reasoning directly. By using corrected datasets for SFT, ARES stabilizes the outputs from the RL stage while allowing the LMMs to deviate meaningfully from pretraining distributions, thereby improving the generated reasoning chains.

Experimental Findings

ARES was evaluated on two prominent datasets: ScienceQA and A-OKVQA, covering a variety of reasoning tasks that require high-level multi-modal understanding. The rationale behind choosing these datasets lies in their provision of detailed rationale chains and multi-modal contexts which test the LMMs' ability to perform complex reasoning.

Notably, ARES demonstrated a 70% win rate against baseline models in assessing the quality of generated rationale reasoning when evaluated by GPT-4o. Moreover, it achieved an average increase of 2.5% in inference accuracy on these datasets. These empirical results underscore the efficacy of ARES in producing more coherent and contextually accurate reasoning chains compared to existing baseline methodologies.

Implications and Future Directions

ARES sets a precedent for leveraging AI feedback for fine-tuning complex model architectures, reducing reliance on expensive human annotations, and refining intricate reasoning capabilities in LMMs. The adaptive methodology it presents may serve as a blueprint for future work aiming to bridge the gap between desired and achieved AI behavior, particularly in multi-modal and complex logical tasks.

Future research could expand upon ARES by integrating external domain knowledge to tackle more abstract questioning contexts, potentially using methods to dynamically source knowledge beyond pre-trained methodologies. Moreover, further optimization of the RL and SFT processes in large-scale LMM architectures could see improvements not only in reasoning quality but also in operational efficiency and deployment scalability.

In conclusion, ARES provides a compelling model improvement method that innovatively combines RL with AI feedback, reinforcing the alignment of LMM operations with human-like reasoning tasks. As AI systems continue to evolve, ARES stands as an important milestone toward more sophisticated conversational agents capable of nuanced, multi-modal reasoning.

PDF Markdown

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Related Papers

Authors (4)

Tweets

https://twitter.com/rohanpaul_ai/status/1839001301583347742

YouTube

Show All Videos