Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 100 tok/s
Gemini 2.5 Pro 58 tok/s Pro
GPT-5 Medium 29 tok/s
GPT-5 High 29 tok/s Pro
GPT-4o 103 tok/s
GPT OSS 120B 480 tok/s Pro
Kimi K2 215 tok/s Pro
2000 character limit reached

LogiPlan: A Structured Benchmark for Logical Planning and Relational Reasoning in LLMs (2506.10527v1)

Published 12 Jun 2025 in cs.AI and cs.PF

Abstract: We introduce LogiPlan, a novel benchmark designed to evaluate the capabilities of LLMs in logical planning and reasoning over complex relational structures. Logical relational reasoning is important for applications that may rely on LLMs to generate and query structured graphs of relations such as network infrastructure, knowledge bases, or business process schema. Our framework allows for dynamic variation of task complexity by controlling the number of objects, relations, and the minimum depth of relational chains, providing a fine-grained assessment of model performance across difficulty levels. LogiPlan encompasses three complementary tasks: (1) Plan Generation, where models must construct valid directed relational graphs meeting specified structural constraints; (2) Consistency Detection, testing models' ability to identify inconsistencies in relational structures; and (3) Comparison Question, evaluating models' capacity to determine the validity of queried relationships within a given graph. Additionally, we assess models' self-correction capabilities by prompting them to verify and refine their initial solutions. We evaluate state-of-the-art models including DeepSeek R1, Gemini 2.0 Pro, Gemini 2 Flash Thinking, GPT-4.5, GPT-4o, Llama 3.1 405B, O3-mini, O1, and Claude 3.7 Sonnet across these tasks, revealing significant performance gaps that correlate with model scale and architecture. Our analysis demonstrates that while recent reasoning-enhanced models show promising results on simpler instances, they struggle with more complex configurations requiring deeper logical planning.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper introduces LogiPlan, a benchmark assessing LLMs in plan generation, consistency detection, and relational reasoning over structured graphs.
  • It employs dynamic task complexity by varying object counts, relation depth, and structural constraints to measure model performance.
  • Experimental results reveal significant performance gaps between reasoning and instruction-based models, guiding future research directions.

LogiPlan: Benchmarking Logical Planning and Relational Reasoning in LLMs

The paper "LogiPlan: A Structured Benchmark for Logical Planning and Relational Reasoning in LLMs" (2506.10527) introduces LogiPlan, a new benchmark designed to evaluate the performance of LLMs in logical planning and reasoning over complex relational structures. The benchmark focuses on assessing the generation, consistency detection, and querying of structured graphs, which are crucial for applications involving network infrastructure, knowledge bases, and business process schemas. LogiPlan enables dynamic variation of task complexity by controlling the number of objects, relations, and the minimum depth of relational chains, thus providing a granular assessment of model performance across different difficulty levels.

LogiPlan Benchmark Design and Tasks

The LogiPlan benchmark comprises three distinct yet complementary tasks, each designed to probe different aspects of logical and relational reasoning: Plan Generation, Consistency Detection, and Comparison Question.

  • Plan Generation: This task requires models to construct valid directed relational graphs that meet specified structural constraints. The models are evaluated on their ability to generate consistent, accurate, and complete plans without human intervention. The overall accuracy is measured, considering the consistency of the generated relations, the absence of redundant relations, and adherence to the specified number of objects and relations. Strong numerical results showed that reasoning models could achieve accuracy scores of 97.9%97.9\% and 97.2%97.2\%, while a reasoning model performed at 30.8%30.8\%, approximately twice as well as the best instruction model. Figure 1

    Figure 1: Plan Generation Relation Duplications in Model Output.

  • Consistency Detection: This task assesses a model's ability to identify inconsistencies, such as cycles or contradictions, within a given relational graph. The evaluation focuses on the model's ability to detect the presence of inconsistencies and correctly identify and extract them. The algorithmic complexity of cycle detection in a directed graph is O((V+E)×(C+1))O((V+E) \times (C+1)), where VV is the number of vertices, EE is the number of edges, and CC is the total number of simple cycles in the graph. In this task, models like O3-mini and O1 significantly outperformed other models. Figure 2

    Figure 2: An example graph for the Consistency Detection task.

  • Comparison Question: In this task, models must interpret a pre-existing relational graph and make informed judgments about specific relations, categorizing them as "True", "False", or "Unknown". This task evaluates not only direct reasoning capabilities but also the model's decision-making process in handling incomplete information and edge cases. The algorithmic complexity of reachability check between two nodes in a directed graph is O(V+E)O(V+E) using BFS or DFS.

The benchmark also evaluates the models' self-correction capabilities by prompting them to verify and refine their initial solutions. This is done by asking "Are you sure?" after the Consistency Detection and Comparison Question tasks to assess the impact of revisions on overall accuracy.

Experimental Evaluation and Results

The LogiPlan benchmark was used to evaluate nine state-of-the-art models, including instruction-tuned and reasoning-capable LLMs: Llama 3.1 405B, Gemini 2.0 Pro, GPT-4o, GPT-4.5, O1, O3-mini, DeepSeek R1, Gemini 2 Flash Thinking, and Claude 3.7 Sonnet. Figure 3

Figure 3

Figure 3

Figure 3

Figure 3

Figure 3

Figure 3

Figure 3

Figure 3: Deepseek R1.

The evaluation revealed significant performance disparities that correlate with model scale and architecture. In the Plan Generation task, reasoning-based models demonstrated a superior ability to identify straightforward generation techniques, such as ordering objects logically (e.g., A>B>>ZA > B > \cdots > Z). For the Consistency Detection and Comparison Question tasks, the performance gap persisted, particularly between O1 and O3-mini and the other models. The Consistency Detection task proved to be the most challenging, with even state-of-the-art reasoning models struggling as the size of the problem increased. Figure 4

Figure 4

Figure 4

Figure 4

Figure 4

Figure 4

Figure 4

Figure 4

Figure 4

Figure 4: Claude 3.7 Sonnet Thinking.

The paper highlights that models like DeepSeek R1 and Gemini 2 Flash Thinking experienced a performance drop in the Consistency Detection task much earlier than other models as problem size increased. Self-correction capabilities also varied, with some models improving their performance upon re-evaluation, while others exhibited minimal change or a performance drop.

Implications and Future Directions

LogiPlan serves as a diagnostic tool and a catalyst for future research, pushing the boundaries of what is achievable in logical relational reasoning with LLMs. The benchmark and the code for data generation and evaluation are open-sourced to enable future research on LLM evaluation and reasoning. The findings suggest that while LLMs have made notable progress in handling structured relational reasoning, significant performance gaps remain, particularly as task complexity increases.

The research underscores the limitations of instruction-based models in generating logically consistent plans and highlights the robust, albeit varied, performance of dedicated reasoning models. Future research directions include developing more sophisticated architectures and self-correction strategies that can better tackle real-world, complex logical planning problems. Additionally, exploring methods to enhance the ability of LLMs to handle larger problem sizes and more intricate relational structures remains a key area for investigation.

Conclusion

The LogiPlan benchmark offers valuable insights into the logical planning and relational reasoning capabilities of LLMs. By providing a structured framework for evaluating these skills, LogiPlan enables researchers to identify the strengths and weaknesses of different models and guide the development of more advanced AI systems capable of tackling complex real-world problems. The benchmark's dynamic nature and open-source availability ensure its continued relevance and contribution to the field of LLM research.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Paper Prompts

Sign up for free to create and run prompts on this paper using GPT-5.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube