Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Fine-Tuning Large Vision-Language Models as Decision-Making Agents via Reinforcement Learning (2405.10292v2)

Published 16 May 2024 in cs.AI, cs.CL, cs.CV, and cs.LG
Fine-Tuning Large Vision-Language Models as Decision-Making Agents via Reinforcement Learning

Abstract: Large vision-LLMs (VLMs) fine-tuned on specialized visual instruction-following data have exhibited impressive language reasoning capabilities across various scenarios. However, this fine-tuning paradigm may not be able to efficiently learn optimal decision-making agents in multi-step goal-directed tasks from interactive environments. To address this challenge, we propose an algorithmic framework that fine-tunes VLMs with reinforcement learning (RL). Specifically, our framework provides a task description and then prompts the VLM to generate chain-of-thought (CoT) reasoning, enabling the VLM to efficiently explore intermediate reasoning steps that lead to the final text-based action. Next, the open-ended text output is parsed into an executable action to interact with the environment to obtain goal-directed task rewards. Finally, our framework uses these task rewards to fine-tune the entire VLM with RL. Empirically, we demonstrate that our proposed framework enhances the decision-making capabilities of VLM agents across various tasks, enabling 7b models to outperform commercial models such as GPT4-V or Gemini. Furthermore, we find that CoT reasoning is a crucial component for performance improvement, as removing the CoT reasoning results in a significant decrease in the overall performance of our method.

Training Vision-LLMs with Reinforcement Learning – An Introduction

Vision-LLMs (VLMs) have shown impressive language reasoning abilities when fine-tuned on specialized visual instruction-following data. However, these models face challenges in multi-step, goal-directed tasks from interactive environments. To tackle this, recent research proposes an innovative algorithmic framework that fine-tunes VLMs using Reinforcement Learning (RL).

What's the Approach?

The research introduces a framework where VLMs, when given a task description, generate intermediate reasoning steps before deciding on a specific text-based action. These actions are then parsed into executable commands for interacting with the environment. RL is applied to fine-tune the VLMs based on the rewards received from these interactions.

Key Components

  1. Chain-of-Thought (CoT) Reasoning: This involves breaking down complex tasks into intermediate steps, making the decision-making process more efficient.
  2. Open-ended Text Actions: The VLM generates text actions that are parsed into concrete actions for environment interaction.
  3. Reinforcement Learning: The VLM is fine-tuned using task rewards, improving its decision-making capabilities.

Experimental Results

The framework was tested on a variety of tasks, including deterministic arithmetic tasks and visual semantic reasoning tasks. Here's a quick look at the results:

Arithmetic Tasks

  • NumberLine: Task of moving a number to a target on a synthetic number line.
  • EZPoints: Using numbers from two cards to compute a specified value.
  • Points24: A harder version of EZPoints requiring the use of four cards.
  • Blackjack: Winning a blackjack game using visual information.

Visual Semantic Reasoning

  • ALFWorld: A text-based interactive environment combined with vision-language instruction to test visual semantic understanding.

Performance Highlights

The new method significantly improved decision-making capabilities. The enhanced models, even with modest sizes like 7 billion parameters, outperformed commercial giants such as GPT-4V and Gemini on most tasks. For instance:

  • NumberLine: Achieved a success rate of 89.4% (vs. 65.5% for GPT-4V)
  • Blackjack: Improved performance to 40.2% from 25.5% (GPT-4V)

The Role of CoT Reasoning

Experiments revealed that CoT reasoning played a critical role in improving model performance. Without it, the model performance dropped notably, underscoring its necessity. Moreover, the paper found that optimal performance was achieved with moderate scaling factors for CoT reasoning, balancing between the reasoning steps and the final text-based action.

Practical and Theoretical Implications

Practically, this research opens avenues for developing more intelligent, autonomous VLM agents capable of handling complex multi-step tasks in dynamic environments. Theoretically, it showcases the potential of integrating CoT reasoning with RL to enhance decision-making processes in VLMs.

Future Directions

The paper suggests two interesting future directions:

  1. Exploring Different Prompting Techniques: While CoT reasoning is beneficial, examining other prompting techniques could further enhance performance.
  2. Multi-task Training: Currently, the framework improves performance on individual tasks. Extending this to improve multiple tasks simultaneously could be a valuable future development.

In summary, the proposed framework combines the strengths of VLMs and RL to tackle the challenges of goal-directed multi-step tasks, demonstrating substantial improvements in decision-making capabilities. This blend of intermediate reasoning and reinforcement learning could indeed pave the way for more sophisticated and capable AI systems.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (84)
  1. Lmrl gym: Benchmarks for multi-turn reinforcement learning with language models. arXiv preprint arXiv:2311.18232, 2023.
  2. Reinforcement learning: Theory and algorithms. CS Dept., UW Seattle, Seattle, WA, USA, Tech. Rep, 32, 2019.
  3. Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision, pages 2425–2433, 2015.
  4. Vision-language models as a source of rewards. arXiv preprint arXiv:2312.09187, 2023.
  5. The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47:253–279, jun 2013.
  6. Dota 2 with large scale deep reinforcement learning. arXiv preprint arXiv:1912.06680, 2019.
  7. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021.
  8. Openai gym. arXiv preprint arXiv:1606.01540, 2016.
  9. trlX: A scalable framework for RLHF, June 2023. URL https://github.com/CarperAI/trlx.
  10. Vision-language models provide promptable representations for reinforcement learning. arXiv preprint arXiv:2402.02651, 2024.
  11. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
  12. Textworld: A learning environment for text-based games. In Computer Games: 7th Workshop, CGW 2018, Held in Conjunction with the 27th International Conference on Artificial Intelligence, IJCAI 2018, Stockholm, Sweden, July 13, 2018, Revised Selected Papers 7, pages 41–75. Springer, 2019.
  13. A survey for in-context learning. arXiv preprint arXiv:2301.00234, 2022.
  14. Minedojo: Building open-ended embodied agents with internet-scale knowledge. Advances in Neural Information Processing Systems, 35:18343–18362, 2022.
  15. Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394, 2023a.
  16. Complexity-based prompting for multi-step reasoning. In The Eleventh International Conference on Learning Representations, 2023b. URL https://openreview.net/forum?id=yf1icZHC-l9.
  17. DeepMind Google. Introducing gemini: our largest and most capable ai model, 2023. URL https://blog.google/technology/ai/google-gemini-ai/.
  18. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International conference on machine learning, pages 1861–1870. PMLR, 2018.
  19. Zero-shot goal-directed dialogue via rl on imagined conversations. arXiv preprint arXiv:2311.05584, 2023.
  20. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=nZeVKeeFYf9.
  21. Visual instruction tuning towards general-purpose multimodal model: A survey. arXiv preprint arXiv:2312.16602, 2023a.
  22. Open rl benchmark: Comprehensive tracked experiments for reinforcement learning. arXiv preprint arXiv:2402.03046, 2024.
  23. Voxposer: Composable 3d value maps for robotic manipulation with language models. In 7th Annual Conference on Robot Learning, 2023b. URL https://openreview.net/forum?id=9_8LF30mOC.
  24. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
  25. Large language models are zero-shot reasoners. Advances in neural information processing systems, 35:22199–22213, 2022.
  26. Ilya Kostrikov. Pytorch implementations of reinforcement learning algorithms. https://github.com/ikostrikov/pytorch-a2c-ppo-acktr-gail, 2018.
  27. The nethack learning environment. Advances in Neural Information Processing Systems, 33:7671–7684, 2020.
  28. Chunyuan Li. Large multimodal models: Notes on cvpr 2023 tutorial. arXiv preprint arXiv:2306.14895, 2023.
  29. Multimodal foundation models: From specialists to general-purpose assistants. arXiv preprint arXiv:2309.10020, 1(2):2, 2023.
  30. Let’s verify step by step. arXiv preprint arXiv:2305.20050, 2023.
  31. Hallusionbench: You see what you think? or you think what you see? an image-context reasoning benchmark challenging for gpt-4v (ision), llava-1.5, and other multi-modality models. arXiv preprint arXiv:2310.14566, 2023a.
  32. Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023b.
  33. Visual instruction tuning. In Thirty-seventh Conference on Neural Information Processing Systems, 2023c. URL https://openreview.net/forum?id=w0H2xGHlkw.
  34. Llava-next: Improved reasoning, ocr, and world knowledge, 2024.
  35. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
  36. Mmbench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281, 2023d.
  37. From gpt-4 to gemini and beyond: Assessing the landscape of mllms on generalizability, trustworthiness and causality through four modalities. arXiv preprint arXiv:2401.15071, 2024.
  38. Revisiting the arcade learning environment: Evaluation protocols and open problems for general agents. Journal of Artificial Intelligence Research, 61:523–562, 2018.
  39. Habitat: A Platform for Embodied AI Research. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2019.
  40. Human-level control through deep reinforcement learning. nature, 518(7540):529–533, 2015.
  41. Embodiedgpt: Vision-language pre-training via embodied chain of thought. arXiv preprint arXiv:2305.15021, 2023.
  42. OpenAI. Gpt-4, 2023a. URL https://openai.com/research/gpt-4.
  43. OpenAI. Gpt-4v, 2023b. URL https://openai.com/research/gpt-4v-system-card.
  44. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  45. Autonomous evaluation and refinement of digital agents. arXiv preprint arXiv:2404.06474, 2024.
  46. Automatically correcting large language models: Surveying the landscape of diverse self-correction strategies. arXiv preprint arXiv:2308.03188, 2023.
  47. Generative agents: Interactive simulacra of human behavior. In Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, pages 1–22, 2023.
  48. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  49. Is reinforcement learning (not) for natural language processing: Benchmarks, baselines, and building blocks for natural language policy optimization. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=8aHzds2uUyB.
  50. Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 3505–3506, 2020.
  51. Vision-language models are zero-shot reward models for reinforcement learning. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=N0I2RtD8je.
  52. Trust region policy optimization. In International conference on machine learning, pages 1889–1897. PMLR, 2015.
  53. High-dimensional continuous control using generalized advantage estimation. In Proceedings of the International Conference on Learning Representations (ICLR), 2016.
  54. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
  55. igibson 1.0: A simulation environment for interactive tasks in large realistic scenes. In 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 7520–7527. IEEE, 2021.
  56. Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10740–10749, 2020.
  57. {ALFW}orld: Aligning text and embodied environments for interactive learning. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=0IOX0YcCdTn.
  58. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484–489, 2016.
  59. Offline RL for natural language generation with implicit language q learning. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=aBH_DydEvoH.
  60. Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008–3021, 2020.
  61. Aligning large multimodal models with factually augmented rlhf. arXiv preprint arXiv:2309.14525, 2023.
  62. Reinforcement learning: An introduction. MIT press, 2018.
  63. Large language models as generalizable policies for embodied tasks. arXiv preprint arXiv:2310.17722, 2023.
  64. Eyes wide shut? exploring the visual shortcomings of multimodal llms. arXiv preprint arXiv:2401.06209, 2024.
  65. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
  66. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
  67. Gymnasium, March 2023. URL https://zenodo.org/record/8127025.
  68. Grandmaster level in starcraft ii using multi-agent reinforcement learning. Nature, 575(7782):350–354, 2019.
  69. Trl: Transformer reinforcement learning. https://github.com/huggingface/trl, 2020.
  70. Voyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291, 2023a.
  71. Scienceworld: Is your agent smarter than a 5th grader? In Conference on Empirical Methods in Natural Language Processing, 2022. URL https://api.semanticscholar.org/CorpusID:247451124.
  72. Self-consistency improves chain of thought reasoning in language models. In The Eleventh International Conference on Learning Representations, 2023b. URL https://openreview.net/forum?id=1PL1NIMMrw.
  73. Describe, explain, plan and select: Interactive planning with large language models enables open-world multi-task agents. arXiv preprint arXiv:2302.01560, 2023c.
  74. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022.
  75. The rise and potential of large language model based agents: A survey. arXiv preprint arXiv:2309.07864, 2023.
  76. Foundation models for decision making: Problems, methods, and opportunities. arXiv preprint arXiv:2303.04129, 2023.
  77. Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601, 2023a.
  78. React: Synergizing reasoning and acting in language models. In The Eleventh International Conference on Learning Representations, 2023b. URL https://openreview.net/forum?id=WE_vluYUL-X.
  79. Mm-vet: Evaluating large multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490, 2023.
  80. Investigating the catastrophic forgetting in multimodal large language model fine-tuning. In Conference on Parsimony and Learning, pages 202–227. PMLR, 2024.
  81. Least-to-most prompting enables complex reasoning in large language models. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=WZH7099tgfM.
  82. Archer: Training language model agents via hierarchical multi-turn rl. arXiv preprint arXiv:2402.19446, 2024.
  83. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.
  84. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593, 2019.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (11)
  1. Yuexiang Zhai (18 papers)
  2. Hao Bai (18 papers)
  3. Zipeng Lin (4 papers)
  4. Jiayi Pan (19 papers)
  5. Shengbang Tong (25 papers)
  6. Yifei Zhou (24 papers)
  7. Alane Suhr (28 papers)
  8. Saining Xie (60 papers)
  9. Yann LeCun (173 papers)
  10. Yi Ma (188 papers)
  11. Sergey Levine (531 papers)
Citations (20)
Youtube Logo Streamline Icon: https://streamlinehq.com
Reddit Logo Streamline Icon: https://streamlinehq.com