Introduction
The development of AI agents capable of human-like planning has been a longstanding goal in the field of AI. TravelPlanner is introduced as a benchmark to assess the capabilities of language agents in complex real-world planning scenarios, particularly focusing on travel planning. This benchmark evaluates agents' performances against meticulously curated planning intents and their ability to utilize nearly four million data records across various tools.
Related Work
Recent breakthroughs have seen LLMs playing a pivotal role in improving language agents. The emergence of models such as GPT-3.5, GPT-4 and Gemini has endowed these agents with capabilities such as advanced memory, tool use, and strategic planning. This has led to significant enhancements in their general problem-solving abilities. Studies demonstrate these agents' competencies in utilizing long-term parametric memory and short-term working memory along with their interactions with external environments through API calls.
TravelPlanner: A Novel Planning Benchmark
TravelPlanner centers around the theme of travel planning – a multidimensional task involving long-horizon predictions and numerous constraints like budget and accommodation preferences. The benchmark challenges agents to construct multi-day itineraries that adhere to a combination of explicit and implicit constraints. The complexity of TravelPlanner is further increased by incorporating environmental dynamics that require the agent to adjust plans according to real-time feedback.
Evaluation Findings
Empirical evaluations using different models like GPT-4 and various planning strategies demonstrate a significant gap between current LLMs' capacities and the requirements of TravelPlanner. Even sophisticated agents struggle with a success rate of merely 0.6%, indicating that while the agents can handle some constraints, they often fail to synthesize a coherent plan that adheres to all at once. Common failure modes include errors in utilizing tools effectively, falling into dead loops due to incorrect actions, and producing hallucinated responses when information is missing or confusing.
Conclusion
TravelPlanner exemplifies a rigorous test for examining language agents' capacity for context-aware planning under realistic and complex constraints. The benchmark signifies the nascent stages of AI's planning abilities and underscores the considerable work necessary to reach human-level adeptness in planning tasks. Future research inspired by TravelPlanner will be instrumental in exploring more sophisticated strategies, enabling agents to handle multiple constraints and long-horizon tasks with the finesse that human planners do.