ZebraLogic: On the Scaling Limits of LLMs for Logical Reasoning

Published 3 Feb 2025 in cs.AI, cs.CL, and cs.LG | (2502.01100v2)

Abstract: We investigate the logical reasoning capabilities of LLMs and their scalability in complex non-monotonic reasoning. To this end, we introduce ZebraLogic, a comprehensive evaluation framework for assessing LLM reasoning performance on logic grid puzzles derived from constraint satisfaction problems (CSPs). ZebraLogic enables the generation of puzzles with controllable and quantifiable complexity, facilitating a systematic study of the scaling limits of models such as Llama, o1 models, and DeepSeek-R1. By encompassing a broad range of search space complexities and diverse logical constraints, ZebraLogic provides a structured environment to evaluate reasoning under increasing difficulty. Our results reveal a significant decline in accuracy as problem complexity grows -- a phenomenon we term the curse of complexity. This limitation persists even with larger models and increased inference-time computation, suggesting inherent constraints in current LLM reasoning capabilities. Additionally, we explore strategies to enhance logical reasoning, including Best-of-N sampling, backtracking mechanisms, and self-verification prompts. Our findings offer critical insights into the scalability of LLM reasoning, highlight fundamental limitations, and outline potential directions for improvement.

Abstract PDF Upgrade to Chat

Summary

The paper demonstrates a critical decline in LLM accuracy on logic puzzles as complexity increases, regardless of model size or additional compute.
It introduces ZebraLogic, a robust evaluation framework using CSP-derived grid puzzles to quantify search space size and Z3 conflict metrics.
The study reveals that scaling models or using advanced sampling techniques only partially overcomes the inherent limitations in logical reasoning tasks.

ZebraLogic: On the Scaling Limits of LLMs for Logical Reasoning

The paper "ZebraLogic: On the Scaling Limits of LLMs for Logical Reasoning" (2502.01100) examines the logical reasoning capabilities of LLMs. It introduces ZebraLogic, an evaluation framework for assessing the scaling limits of LLM reasoning performance on logic grid puzzles derived from constraint satisfaction problems (CSPs). The results indicate a significant decline in accuracy with increasing problem complexity, regardless of model size or inference-time computation.

Problem Formulation and Setup

The ZebraLogic framework constructs puzzles requiring the assignment of attributes to houses based on a set of clues, drawing on CSPs to provide a mathematically robust test environment. Each ZebraLogic puzzle consists of $N$ houses and $M$ attributes, with clues introducing constraints that the model must satisfy.

Figure 1: An example of ZebraLogic features with 3 houses and 3 attributes, illustrating the task for assigning attributes based on provided clues.

Complexity Metrics

Two metrics evaluate problem complexity: search space size and Z3 conflicts. Search space size represents the number of possible configurations, while Z3 conflicts measure logical complexity. The puzzles challenge LLMs by scaling these dimensions.

Main Findings

The findings underscore a "curse of complexity" in logical reasoning with LLMs; their accuracy degrades markedly as search space complexity increases. Simple problems remain solvable, but accuracy diminishes drastically for more intricate problems, indicating profound scalability issues.

Figure 2: Accuracy vs number of Z3 conflicts for various models, depicting scaling effects on reasoning performance.

The study demonstrates that neither increasing model size nor enhancing inference-time computation sufficiently addresses this scalability issue. For instance, Llama-3.1-405B outperforms smaller models in simpler search spaces, yet fails similarly as complexity rises.

Scaling Model Size and Test-Time Compute

While increasing LLM size shows potential improvements for smaller puzzles, it does not yield comparable gains in more complex domains. This suggests intrinsic limitations in scaling model sizes for significant complexity.

Figure 3: Comparison of model size scaling and test-time compute scaling under varied search space complexities.

Expanding test-time compute through methods like Best-of-N sampling and backtracking can offer partial performance gains. Although repeated sampling improves theoretical performance potential (e.g., GPT-4o with BoN sampling nears o1 performance), practical gains remain constrained, highlighting the inadequacy of current techniques.

Hidden Reasoning Tokens

The o1 models utilize extensive hidden chain-of-thought (CoT) tokens during inference. This approach, not visible to users, appears pivotal in o1's superior problem-solving abilities. However, this method's limits are exposed when complexity crest certain bounds, indicating a saturation point for current strategies.

Figure 4: The relationship between hidden CoT tokens and Z3 conflicts, suggesting scale with problem complexity.

Conclusion

The results from ZebraLogic contribute significant insights into the boundaries of LLM reasoning scalability. Current frameworks struggle with high-complexity reasoning tasks, and both traditional scaling and compute-intensive strategies fall short. Future research should explore innovative methodologies to enhance logical reasoning capabilities in LLMs, emphasizing robust, scalable non-monotonic reasoning mechanisms.