Agentic Task Synthesis Pipeline
- Agentic task synthesis pipeline is a framework that converts high-level task descriptions into executable workflows using iterative reasoning and coding agents.
- It integrates a ReAct control loop, persistent IPython kernel, and a modular tool suite to support dynamic error correction and stateful code execution.
- Empirical evaluations show 100% success on CP-Bench, demonstrating superior fault tolerance and efficiency compared to static, predetermined pipelines.
The agentic task synthesis pipeline is a paradigm for automating the end-to-end translation of high-level task descriptions into executable, tool-guided workflows via self-directed reasoning, code execution, and verification. In contrast to static, predetermined pipelines, agentic task synthesis leverages a general coding agent—often steered by a ReAct (Reason and Act) control loop and powerful prompt engineering—to achieve incremental, stateful development, dynamic error correction, and modular verification. The architecture exemplified by CP-Agent demonstrates that success in complex domains (e.g., constraint programming) depends not on hand-crafted agent logic or rigid workflows, but on the confluence of general-purpose code execution interfaces, contextual memory, and prompt-encoded domain expertise (Szeider, 10 Aug 2025).
1. Architectural Foundations
Agentic task synthesis is implemented in CP-Agent using a minimal yet robust architecture. Core components include:
- ReAct Loop Controller: Orchestrates interleaved “Think” and “Do” stages, cyclically engaging the agent in reasoning and tool invocation until convergence.
- Tool Suite: Exposes a concise set of primitives—
read_file,write_file,list_files,delete_file,python_exec,todo_write—to support file I/O, code execution, and task management. These operations are strictly confined to the working directory and return structured signals to the loop controller. - Persistent IPython Kernel: All code execution is stateful. The agent interacts with a long-lived IPython process via ZeroMQ, allowing variable retention and cumulative definitions across multiple tool calls.
- Prompt Hierarchy: Task synthesis is driven by three layers:
- System Prompt (~200 lines): General agent conventions, tool usage, error handling.
- Project Prompt (~700 lines): Domain-specific (CP) modeling templates, mandatory workflows, archetype catalog, verification checklists.
- Task Prompt: The raw natural-language problem description.
The singleton kernel may be wrapped in an isolated environment for package-specific needs (e.g., CPMpy), ensuring state persistence until requirements change or the session ends.
2. Agentic Reasoning and ReAct Loop
The ReAct loop instantiates the agentic workflow:
- Initialization: The agent loads the problem statement (
read_file) into memory, primes the LLM context with system and project prompts, and begins iterative synthesis. - Iteration: The loop cycles through “Think–Do–Observe”:
- Think (Reason): The LLM examines current context, tool call memory, partial code, and task progress. Reasoning traces yield decomposition plans, decision variable selection, and hypotheses about modeling strategies (e.g., use of
cp.AllDifferent). - Do (Act): LLM parses its own output to decide on a tool call (
python_execfor code,todo_writefor task lists,write_filefor final output). - Observe (Feedback): Results from tool calls (e.g., code output, exception stack traces) are injected back into the context, guiding subsequent reasoning steps.
- Debugging Feedback: Exceptions (e.g., import errors, model failures) trigger explicit repair actions, such as command corrections or import amendments in new code executions.
- Think (Reason): The LLM examines current context, tool call memory, partial code, and task progress. Reasoning traces yield decomposition plans, decision variable selection, and hypotheses about modeling strategies (e.g., use of
Termination: The agent emits a
COMPLETEsignal and writes the final, verified CPMpy script, halting the loop.
3. Project Prompt Specification and Verification
The project prompt encodes domain expertise and prescribes invariant rules for constraint modeling:
- Mandatory Workflow:
- Deconstruct: Parse all input data, extract parameters, infer constants.
- Model: Define decision variables using global constraints; incrementally construct the constraint logic from domain to structure to objectives.
- Solve & Verify: Execute the CPMpy model; independently validate outputs in pure Python, ensure JSON format conformity; verify all constraints and re-calculate objectives for optimization.
- Finalize: Output cleanup and JSON formatting per requirements.
- Verification: Completion requires passing a 12-item compliance checklist, including domain-specific modeling patterns and avoidance of illicit Python constructs (e.g.,
ifexpressions outside CP context). - Playbook Catalog: Maps keywords to canonical archetype patterns (e.g., TSP template for “visit every location”, assignment model for “assign workers”).
- Debug & Performance Tips: Recommends integer scaling for solver compatibility, conditional constructs via
cp.Implies, and symmetry-breaking patterns. - Appendices: Common modeling pitfalls and API reference.
4. Interfaces and Execution Environment
The agent's environment is defined by strictly constrained IPC and execution primitives:
- File Operations:
read_file(path: str),write_file(path: str, content: str),list_files(pattern: str="*"),delete_file(path: str)—all paths infer working-directory context.
- Code Execution:
python_exec(code: str)transmits code to a stateful IPython kernel, persisting context across calls.
- Task Management:
todo_write(todos: List[Dict[id,content,status,priority]])enforces a singlein_progresstask at a time, with contextual recall.
- Orchestrator: LangGraph manages loop, memory, and tool dispatch; LLM is served via OpenRouter to Claude 4 Sonnet, supporting streaming with logging.
5. Core Workflow and Modeling Snippets
Core pipeline pseudocode enforces incremental reasoning, tool invocation, and observation injection:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
kernel = None # persistent IPython kernel tools = { "read": read_file, "write": write_file, "exec": python_exec, "todo": todo_write } context = load_system_prompt() + load_project_prompt() problem_text = tools["read"]("task.md") context += problem_text done = False while not done: lm_output = call_llm(context) if lm_output.calls_tool: tool_name, args = parse_tool_call(lm_output) result = tools[tool_name](**args) obs = format_observation(tool_name, result) context += lm_output.content + obs elif lm_output.signals_completion: tools["write"](lm_output.filename, lm_output.code) done = True else: context += lm_output.content |
Modeling snippets are directly mapped to CPMpy code—for example, an AllDifferent constraint is encoded:
1 2 3 |
from cpmpy import * x = intvar(1, n, shape=n, name="x") model = Model(AllDifferent(x)) |
Summation constraints (e.g., for knapsack problems) employ:
1 2 |
total_weight = cp.sum([x[i] * weight[i] for i in range(n_items)]) model += total_weight <= capacity |
6. Empirical Evaluation: CP-Bench and Comparative Analysis
- Benchmark Setup: Applied to CP-Bench, comprising 101 problems (30 optimization, 71 satisfaction), sourced from CSPLib, CPMpy, and course exercises.
- All task descriptions (task.md) are paired with explicit JSON schemas.
- Each run records the final CPMpy script plus conversation logs (agentic-python-coder v1.0.0, Claude 4 Sonnet 20241022).
- Validation:
- Satisfaction: Models are checked against reference CPMpy solutions for constraint satisfaction.
- Optimization: Objective value is independently checked for optimality.
- Checklist enforcement ensures compliance.
- Results:
- 100% success rate: All 101 problems solved optimally; no failures observed.
- Tool usage statistics:
python_execcalls/problem: 4–23 (higher for scheduling).todo_write: used in 59 problems, with task lists (1–10 items).read_file,write_file: exactly one in 99/101 cases (two exceptions with intermediates).- Token usage per problem: ~180K input, ~6K output.
- Fault tolerance: All exception types were absorbed and repaired within the loop.
- Comparison:
- Fixed-pipeline methods plateau at ~70% accuracy (CPMpy subset).
- CP-Agent’s pure agentic approach achieves complete coverage and superior flexibility.
7. Significance and Generalization
The agentic task synthesis pipeline, as exemplified by CP-Agent, demonstrates the practical and theoretical advantages of general coding agents for structured modeling:
- Elimination of rigid architectures: Success is driven by prompt-encoded expertise and dynamic memory, not by embedding domain logic into architecture.
- Efficiency: Achieves coverage with a minimal implementation (~hundreds of lines of code).
- Debuggability and robustness: Exception handling and self-repair are integral, supporting dynamic convergence.
- Applicability: The paradigm generalizes to other domains where iterative code synthesis, domain-specific prompt design, and stateful execution are necessary for optimal performance.
This approach situates agentic synthesis pipelines as the canonical strategy for scalable, verifiable, and high-fidelity translation of natural-language tasks into executable programs in complex domains (Szeider, 10 Aug 2025).