Cook2LTL: Translating Recipes into LTL Plans
- Cook2LTL is a system that translates unstructured cooking recipes into temporally precise, robot-executable plans using Linear Temporal Logic.
- It combines a lightweight semantic parser, pretrained large language models, and dynamic caching to efficiently reduce high-level instructions into kitchen primitives.
- Empirical evaluations in simulation demonstrate that action caching significantly reduces API calls, latency, and cost while improving plan reliability.
Cook2LTL is a system for translating unstructured natural language cooking recipes into unambiguous, temporally precise robot-executable plans specified in Linear Temporal Logic (LTL). Developed to address the linguistic and temporal complexity of everyday recipes and the vast action space inherent to cooking, Cook2LTL combines a lightweight semantic parser, pretrained LLMs, a dynamic action decomposition library with caching, and formal LTL schema to generate grounded kitchen task plans (Mavrogiannis et al., 2023).
1. System Architecture and Workflow
Cook2LTL architecture integrates four primary components to enable the end-to-end translation of recipe instructions into LTL:
- Cooking Domain Knowledge Base: Captures the set of kitchen-relevant primitive actions (; e.g., pick, place, turn_on, wait) and the semantic schema () for instruction parsing, comprising categories Verb, What?, Where?, How?, Time, and Temperature.
- Semantic-Parsing Module: Processes each recipe step , segmenting imperative sentences into their semantic constituents according to and emitting function-style action descriptors .
- LLM-Based Translator: For descriptors , generates reductions into sequences of primitives in using “prog-prompt” style few-shot prompting. Also produces an initial LTL “skeleton” based on the parsed descriptors.
- Dynamic Action Library and Caching: Each new action decomposition is stored in a runtime cache, enabling lookups and bypassing redundant LLM queries on repeated high-level actions.
Algorithm 1 in the system formalizes the full action pipeline: preprocess the instruction; semantically parse to obtain descriptors; check for known primitive mappings (via or the cache ); reduce composite actions by LLM and update the cache; finally, construct the full LTL expression over the sequence of reduced actions (Mavrogiannis et al., 2023).
2. Mapping Natural Language to Primitives
2.1 Annotation and Parsing Process
The approach utilizes a manually annotated corpus of 1,000 recipe steps from Recipe1M+, where spans are labeled according to . A sequence-tagger model (NER/tagging) is fine-tuned to segment each instruction into function-style descriptors, e.g., .
2.2 Action Reduction by LLM
If a semantic action descriptor is an element of the primitive set , it is directly mapped; otherwise, Cook2LTL constructs a “prog-prompt” for the LLM, supplying previous examples and a parser-extracted target action for decomposition. The LLM emits a procedural breakdown in the form of primitive API calls (e.g., pick, place, turn_on), which are then cached for future reuse.
A concrete example is the reduction of “chop the onion until translucent” into the primitives: pick(onion), place(onchopping_board), slice(onion), wait_until(translucent) (Mavrogiannis et al., 2023).
3. Linear Temporal Logic Formalization
Cook2LTL generates LTL specifications leveraging standard syntax:
Where represents atomic propositions corresponding to primitive action completion (“chop_onions”, etc.), and the temporal operators denote behavioral constraints:
- : Globally; holds at all time steps.
- : Eventually; holds at some (future) step.
- : Next; holds at the next immediate step.
- : Until; holds until .
Key formula patterns include safety properties (), ordering constraints (), and multi-step sequencing using nested eventually operators (Mavrogiannis et al., 2023).
4. Example Translations
Single-Step Example:
Instruction: “Chop the onion until translucent.”
- Semantic parse: , Time = “until translucent”
- Action reduction: [pick(onion), place(onchopping_board), slice(onion), wait_until(translucent)]
- LTL formula:
Multi-Step Example:
Instruction: “Add oil, then cook for 5 min at medium heat.”
- Parses to , ; both reduce to primitives.
- LTL formula:
These constructions directly encode instructional temporality and facilitate execution by downstream planners (Mavrogiannis et al., 2023).
5. Runtime Caching and Efficiency
The dynamic action reduction cache is core to Cook2LTL efficiency. At each parsing step, the cache is checked for previously reduced high-level actions before invoking the LLM. This mechanism achieves action decomposition for actions already encountered, sharply reducing redundant API requests, cost, and conversion time.
Empirical results on 50 held-out recipes from Recipe1M+ demonstrate:
| System | Executability | Time | Cost | API Calls |
|---|---|---|---|---|
| AR* (no cache) | 91 % | 14.85 m | \$0.19 | 275 |
| AR (primitives) | 92 % | 9.89 m | \$0.16 | 231 |
| Cook2LTL (+) | 94 % | 6.05 m | \$0.11 | 134 |
In the AI2-THOR simulation environment, Cook2LTL yields significant reductions in LLM API calls (-51%), latency (-59%), and cost (-42%), relative to a baseline with no cache (Mavrogiannis et al., 2023).
6. Experimental Evaluation in Simulated Environments
Evaluations are conducted in AI2-THOR, with primitive set limited by the simulator's kitchen manipulator API. Four representative tasks were used (microwave potato, chop tomato, cut bread, refrigerate apple), and the following results were reported:
| Task | SR | Time | SR | Time |
|---|---|---|---|---|
| Microwave the potato | 5.4 % | 27.3 s | 8 % | 3.3 s |
| Chop the tomato | 2.4 % | 16.0 s | 4 % | 1.6 s |
| Cut the bread | 9 % | 12.9 s | 8 % | 1.1 s |
| Refrigerate the apple | 7.6 % | 14.6 s | 8 % | 1.6 s |
Metrics included success rate (SR) and LLM-induced latency per episode, indicating improved performance under action decomposition caching. These improvements are attributed to the elimination of redundant LLM calls and streamlined action translation (Mavrogiannis et al., 2023).
7. Strengths, Limitations, and Future Directions
Strengths:
- Robust open-vocabulary parsing of free-form web recipes.
- Flexible decomposition into any user-defined primitive set .
- Rigorous, temporally-rich LTL specifications readily consumable by planning modules.
- Substantial reduction in LLM-driven cost and latency via dynamic action caching.
Limitations and Directions:
- Current semantic parser is trained on 1,000 steps; larger scale or self-supervised augmentation is required to ensure robustness in diverse, real-world recipe corpora.
- AI2-THOR supports only rudimentary "toy" action primitives; extension to more realistic simulators and physical robot trials (e.g., with YCB objects) is identified as future work.
- No runtime verification of LLM-generated reductions is performed; incorporation of environment-based feedback or symbolic plan checking (cf. AutoTAMP) is suggested to strengthen reliability.
The system demonstrates the capacity of LLM-guided pipelines, in concert with formal methods and adaptive caching, to transform unstructured instructional text into high-fidelity, robot-executable temporal plans, offering a foundation for further advances in automated task understanding and planning in robotics (Mavrogiannis et al., 2023).