R‑ConstraintBench: Evaluating LLM Feasibility
- R‑ConstraintBench is a benchmarking framework for RCPSP, rigorously testing LLM feasibility by incrementally layering constraints in a controlled DAG structure.
- It systematically introduces operational constraints—resource downtimes, temporal windows, and disjunctive rules—to simulate complex, real-world scheduling scenarios.
- Empirical evaluations reveal that while LLMs perform well on simple precedence cases, the interaction of multiple constraints leads to rapid drops in scheduling feasibility.
R‑ConstraintBench is a benchmarking framework developed to rigorously evaluate the feasibility reasoning capabilities of LLMs under Resource-Constrained Project Scheduling Problems (RCPSPs)—an archetypal NP‑Complete scheduling class. Unlike traditional benchmarks focusing on solution optimization, R‑ConstraintBench emphasizes full feasibility, especially under diverse and interacting operational constraints. The framework incrementally increases structural complexity by layering non‑redundant precedence constraints in Directed Acyclic Graphs (DAGs) and then systematically injects resource downtimes, temporal windows, and disjunctive constraints. A prominent domain‑grounded instantiation is presented via data center migration scheduling. Empirical results demonstrate that while top LLMs maintain near-perfect feasibility on precedence-only DAGs, their reliability collapses under interacting constraints, revealing the primary bottleneck lies in constraint interaction—not graph depth. Performance on synthetically generated ramps further fails to guarantee transfer to domain-specific scenarios, highlighting limited generalization.
1. Benchmarking Framework and Problem Class
R‑ConstraintBench targets the RCPSP feasibility class, which requires determining whether all constraints—resource, temporal, operational—can be simultaneously met for a given schedule. The class is NP‑Complete, aligning with real-world planning demands in industries such as construction, logistics, manufacturing, and IT. The core benchmarking problem is to evaluate LLMs' capacity for latent feasibility reasoning under increasing constraint complexity, not merely execution of topological sorts or simple precedence orderings.
The framework constructs instances as layered DAGs, partitioning tasks into groups (L₁, ..., Lₘ) and incrementally introducing non‑redundant cross-layer precedence constraints. Each instance generation step is tightly controlled to ensure acyclicity and manage the combinatorial growth of feasible schedules, permitting granular analysis of model robustness as constraint density grows.
2. Methodological Details and Formal Constraint Definitions
R‑ConstraintBench follows a two-tiered instance generation protocol:
- Layered DAG Structure: At level k, k non‑redundant precedence edges are added, yielding controlled progression in scheduling complexity. Formally, for tasks T, durations pᵢ, and precedence (i → j), the basic feasibility constraint is:
- Multi-Axis Operational Constraints: Three operational constraint types are introduced after the DAG is defined:
- Resource Downtime: For resource r, there exists time intervals where , enforcing absolute unavailability.
- Temporal Windows: Each task is assigned a release time and a deadline , producing constraints and .
- Disjunctive Constraints: Pairs of tasks are forbidden from overlap; either or must hold.
Resource feasibility is formalized as:
Instances are systematically sampled such that each constraint type is present at a controlled probability (e.g., 75%), allowing analysis of failure patterns as constraint complexity escalates.
3. Empirical Evaluation and Results
R‑ConstraintBench evaluation comprises two empirical phases:
- Phase I: DAG-only schedules (pure precedence). Models such as Grok‑4, GPT‑5, o3, and Gemini‑2.5‑Pro achieve near‑ceiling feasibility—consistent with the polynomial solvability of topological sorting in DAGs.
- Phase II: Multi-Constraint Interaction (MCI). When resource downtimes, temporal windows, and disjunctive constraints are injected, LLMs universally exhibit precipitous drops in schedule feasibility. Analytical metrics include per-level feasibility, weighted area under the curve (WAUC), and breakpoint analysis (point at which feasibility drops below an operational threshold, e.g., 70%). o3 and GPT‑5 maintain higher feasibility deeper into the complexity ramp, but all models degrade rapidly under combined constraint regimes.
A domain-grounded evaluation (Phase IIb) in a data center migration scenario—mapping scheduling layers to shutdown, unrack, transport, install, and test operations—further reveals that even high-performing models, notably GPT‑5, experience significant reliability loss under real-world operational constraints.
4. Domain-Grounded Scheduling: Data Center Migration Example
The data center migration setting exemplifies the translation of R‑ConstraintBench to operationally relevant workflows:
- Workflow Layers: Five phases (shutdown → unrack → transport → install → test) constitute the backbone.
- Resource Constraints: Specialized resources (IT_Team, DC_Crew, Network_Engineers, Forklift, Convoy) with varying capacities and exclusive use periods.
- Downtime: Planned maintenance and inspections introduce resource null intervals.
- Temporal Windows: Regulatory deadlines and earliest start times model cut‑over periods.
- Disjunctive Constraints: Critical equipment or personnel cannot be shared across overlapping tasks.
Feasibility reasoning in this scenario requires LLMs to coordinate domain and operational knowledge with the formal RCPSP logic, and performance degrades rapidly as constraint density increases—even for models that excel on synthetic benchmarks.
5. Analysis of Constraint Interaction and Failure Modes
The principal empirical finding is that constraint interaction, not increased dependency depth, constitutes the key bottleneck for LLM scheduling reliability. Specific error analysis shows:
- Violation Patterns: Some models fail most frequently on precedence, others on temporal windows or disjunctive constraints, indicating specialization or gaps in reasoning algorithms.
- Global Consistency Requirements: Feasibility demands simulataneously satisfying all constraint types—LLMs that only reason locally (e.g., per layer or resource) are insufficient for NP‑Complete feasibility.
- Domain-Generalization Failure: Transfer from synthetically clean instances to domain-grounded problems is nontrivial; operational entanglements introduce new, context-dependent failure modes.
A plausible implication is that LLM scheduling ability requires training or prompting protocols that specifically target multimodal/heterogeneous constraint reasoning, not mere depth or precedence.
6. Implications, Limitations, and Future Directions
The evidence from R‑ConstraintBench highlights major limitations in LLM feasibility reasoning:
- Generalization Gap: High synthetic performance does not ensure robust operational transfer.
- Constraint Mixing Vulnerability: Interacting constraints (especially resource downtime and disjunctive exclusivity) rapidly defeat schedule feasibility.
- Specificity to Domain Context: Models must adapt to new resource, timing, and operational regimes beyond abstract graph reasoning.
The framework suggests several future research directions, including:
- Fine-tuning LLMs on heterogeneous scheduling domains,
- Systematic ablations to isolate prompt and instance sensitivities,
- Constructing benchmarks for other NP‑Complete classes (job-shop, vehicle routing),
- Exploring multi-step program-aided, chain-of-thought, or verification-assisted reasoning for improved feasibility rates.
The R‑ConstraintBench approach thus establishes a rigorous foundation for exposing and remedying current LLM limitations in constraint-rich, large-scale scheduling, while providing a roadmap for future diagnostic and adaptive model development.