- The paper presents SCOPE, a two-stage planning framework that disentangles query-specific reasoning from generic solver generation.
- It achieves high accuracy and robustness, with success rates reaching 100% on meeting planning and significant gains over chain-of-thought methods.
- The framework reduces inference cost and latency while ensuring reusability and scalability of solver code across similar multi-constraint tasks.
Programming over Thinking: Efficient and Robust Multi-Constraint Planning
Motivation and Problem Setting
Multi-constraint sequential planning requires the decomposition of queries into candidate solutions satisfying multiple and sometimes conflicting constraints, typical in real-world tasks such as travel itinerary generation and meeting scheduling. Traditional LLM-driven approaches, predominantly those based on text-based chain-of-thought or multi-agent reasoning, exhibit scaling bottlenecks and robustness failures. Specifically, long natural language reasoning chains tend to accumulate errors and lose consistency with complex or lengthy constraint structures, while code- or solver-based strategies are typically query-specific, imposing inflexible and non-generalizable execution logic. The probabilistic nature of LLM outputs exacerbates these issues, hindering consistent constraint tracking and leading to high inference costs as solution space expands.
Framework: Scalable COde Planning Engine (SCOPE)
SCOPE introduces a two-stage disentangled planning and execution paradigm, operationalizing multi-agent LLM workflows. The query-specific reasoning stage formalizes the problem: LLM agents extract a structured representation of combinations (candidate generation parameters) and constraints (validation logic) from a single example query–solution pair. These structured representations, once optimized via multiple parameter-free refinement agents, define the generic solver abstraction for the problem domain.
The second stage — generic solver generation — programmatically synthesizes reusable, deterministic solver functions:
- Combination Function: Exhaustively enumerates candidate plans using the formalized combination parameters, supporting permutation and assignment invariants dictated by the domain.
- Filter Function: Deterministically selects valid plans from candidates based solely on constraint satisfaction, independent of query-specific logic.
- Deliver Function: Formats structured solution outputs as domain-aligned natural language descriptions.
Critically, the solver code is unchanged across queries of the same domain; only input parameters (structured combinations and constraints output by LLM inference) are adapted. Solver code refinement is performed autonomously by comparing generated and ground-truth outputs, ensuring code meets domain requirements without manual prompt engineering or heuristics.
Experimental Evaluation
Benchmarks and Model Families
SCOPE was evaluated on TravelPlanner [Xie2024TravelPlanner] and Natural Plan [Zheng2024NaturalPlan], representing canonical multi-constraint planning environments with combinatorial complexity and closed constraint systems. Experiments spanned five proprietary LLMs (GPT-4o, GPT-o3, GPT-5, Gemini-1.5-Pro, Gemini-2.5-Pro), and compared against reasoning baselines: direct prompting, Chain-of-Thought [Wei2022ChainOfThought], Tree-of-Thought [Yao2023ToT], EvoAgent [Yuan2025Evoagent], HyperTree Planning [gui2025HTP], and code-based Thought of Search [Liu2024Tos].
Numerical Results
SCOPE achieves strong empirical performance:
- TravelPlanner (GPT-4o): SCOPE succeeds on 93.1% of queries, representing a 61.6% gain over CoT (success 31.5%).
- Trip Planning (GPT-4o): SCOPE, at 87.1%, far exceeds ToS (12.5%) and CoT (3.9%).
- Meeting Planning (GPT-4o): SCOPE achieves 100% success, while ToS registers 59.8% and CoT 47.4%.
- Efficiency: SCOPE reduces inference cost by up to 1.4× and latency by 4.67× compared to leading baselines, especially as planning horizon or constraint count increases.
- Performance consistency: SCOPE offers minimal drop in accuracy as combinatorial or constraint complexity increases, in contrast to baselines that degrade rapidly.
On stronger models (GPT-5, Gemini-2.5-Pro), SCOPE matches or exceeds baseline performance while achieving significantly better cost and latency scaling; on weaker models, SCOPE demonstrably closes the gap to state-of-the-art. The analysis details robustness gains in error-prone planning horizons and under long-horizon constraint aggregation.
Ablation and Error Analysis
Systematic ablation of SCOPE components (problem formalization, optimization, refinement) results in severe performance drops, underscoring the necessity of each agentic stage for robust abstraction and generalization. Error analysis indicates that the principal failure mode is the Input Agent’s misinterpretation in query-to-parameter mapping, especially for smaller models, or overgeneralization from demonstrations, typically not solver-related.
Theoretical and Practical Implications
SCOPE demonstrates that disentangling natural language reasoning from execution logic substantially mitigates the fundamental limitations of probabilistic LLM output in planning. The explicit abstraction of combinatorial generation and constraint satisfaction not only enables statically sound, reusable solver logic but also induces strong generalization to unseen queries within a domain. This approach is architecturally orthogonal to existing slow-thinking and multi-agent reasoning paradigms, circumventing the error propagation and scaling bottlenecks intrinsic to text-driven models.
Practically, SCOPE enables efficient deployment of LLM-based agents in real-world settings requiring robust, cost-effective constraint satisfaction and planning — for example, itinerary generation, high-frequency scheduling, and resource allocation. The independence of solver code from query content supports modular domain adaptation and swift inference.
Theoretically, SCOPE offers a bridge between symbolic planning, combinatorial search, and LLM-based natural language understanding. It provides a pathway for integrating declarative representations and procedural code within LLM workflows, supporting future research into cross-domain code abstraction, interpretable AI planning, and hybrid symbolic–neural reasoning.
Future Directions
Open challenges remain. SCOPE’s solvers generalize only within a domain; domain transfer requires re-formalization and code regeneration. Further, the reliance on the coding competence of proprietary LLMs may limit transferability to open-source or specialized models. Promising future directions include meta-abstraction of solver code across domains, automated benchmarking of solution space and constraint specification, and downstream applications in real-time agentic coordination and multimodal planning.
Conclusion
The SCOPE framework establishes an efficient, robust paradigm for multi-constraint planning with LLMs by separating query-specific formalization from generic solver code execution. The empirical and theoretical analyses show clear superiority in accuracy, scalability, and efficiency, enabling practical deployment of LLM agents for complex planning tasks and inspiring future developments in programmatic AI reasoning (2601.09097).