Language Models can Solve Computer Tasks (2303.17491v3)

Published 30 Mar 2023 in cs.CL, cs.AI, cs.HC, and cs.LG

Abstract: Agents capable of carrying out general tasks on a computer can improve efficiency and productivity by automating repetitive tasks and assisting in complex problem-solving. Ideally, such agents should be able to solve new computer tasks presented to them through natural language commands. However, previous approaches to this problem require large amounts of expert demonstrations and task-specific reward functions, both of which are impractical for new tasks. In this work, we show that a pre-trained LLM agent can execute computer tasks guided by natural language using a simple prompting scheme where the agent Recursively Criticizes and Improves its output (RCI). The RCI approach significantly outperforms existing LLM methods for automating computer tasks and surpasses supervised learning (SL) and reinforcement learning (RL) approaches on the MiniWoB++ benchmark. We compare multiple LLMs and find that RCI with the InstructGPT-3+RLHF LLM is state-of-the-art on MiniWoB++, using only a handful of demonstrations per task rather than tens of thousands, and without a task-specific reward function. Furthermore, we demonstrate RCI prompting's effectiveness in enhancing LLMs' reasoning abilities on a suite of natural language reasoning tasks, outperforming chain of thought (CoT) prompting with external feedback. We find that RCI combined with CoT performs better than either separately. Our code can be found here: https://github.com/posgnu/rci-agent.

Authors (3)

Geunwoo Kim (1 paper)
Pierre Baldi (89 papers)
Stephen McAleer (41 papers)

Citations (264)

View on Semantic Scholar

Summary

Insights on "LLMs can Solve Computer Tasks"

The paper "LLMs can Solve Computer Tasks" presents an innovative approach to harnessing the capabilities of pre-trained LLMs for executing computer tasks. This is achieved through a method called Recursive Criticism and Improvement (RCI), which significantly surpasses existing methods in efficacy. The authors set out to address the limitations of previous methodologies that rely on vast amounts of expert demonstrations or task-specific reward functions—both of which are infeasible for novel tasks.

Key Contributions

The authors introduce the RCI prompting scheme, a simple yet effective technique wherein the LLM critiques and refines its prior output to enhance performance on designated tasks. This recursive mechanism enables the LLM to revise its decisions continuously until the task requirements are satisfied, resulting in improved task accuracy and robustness.

The paper demonstrates that RCI combined with the InstructGPT-3 model with Reinforcement Learning from Human Feedback (RLHF) achieves state-of-the-art results on the MiniWoB++ benchmark. This achievement is noteworthy with only a few task demonstrations compared to the tens of thousands needed by previous models, eliminating the dependency on task-specific reward functions.

Methodology and Results

The RCI method involves a decomposition of action selection into three grounding steps: task grounding, state grounding, and agent grounding. Task grounding involves generating a high-level plan based on the given task description. State grounding ensures that actions are feasible in the current state by relating high-level concepts to specific HTML page elements. Finally, agent grounding guarantees that actions can be executed correctly by the computer agent. The authors demonstrate that RCI prompting improves reasoning capabilities across a suite of natural language reasoning tasks, outperforming existing zero-shot and chain-of-thought prompting methods.

Quantitatively, the research exhibits strong numerical results with RCI prompting methods achieving substantial improvements over baseline LLM approaches. On varied reasoning benchmarks, Zero-Shot and Chain-of-Thought techniques augmented with RCI consistently outperformed their original implementations, highlighting RCI's substantial impact.

Implications and Future Perspectives

The implications of this research are profound, both theoretically and practically. From a theoretical standpoint, the ability of RCI prompting to enhance reasoning in LLMs suggests new directions for improving general architectures for decision-making in AI. Practically, this technique could enhance productivity in environments where complex computer tasks prevail, diversifying the potential applications of LLMs far beyond current use cases.

Looking ahead, as LLMs advance, the expectation is that RCI-selected actions will further optimize decision-making tasks. The paper opens avenues for integrating multimodal foundation models, which combine text, images, audio, and video, thus broadening the scope and robustness of AI systems in real-world applications. Furthermore, fine-tuning LLMs specifically for computer task-solving, expanding action spaces, and enhancing reasoning abilities remain critical areas for continued exploration.

In conclusion, this work provides valuable insights and a promising approach to scaling LLM capabilities for novel and efficient task automation, setting a foundation for future advancements in AI applications across extensive domains.

PDF Markdown

Related Papers

Find Related Papers

GitHub

GitHub - posgnu/rci-agent: A codebase for "Language Models can Solve Computer Tasks" (223 stars)

Tweets

https://twitter.com/ETomatot24044/status/1807231659219300438

YouTube

Show All Videos