Task–Function–Code (TFC) Framework
- The Task–Function–Code (TFC) framework is a structured methodology that decomposes complex natural language tasks into subtasks, function mappings, and executable code.
- It employs a multi-stage process including task parsing, function mapping, code generation, execution, and self-repair to enhance reliability.
- TFC is applied in tool learning, geospatial reasoning, and data science benchmarks to ensure reproducibility and metric-driven evaluations.
The Task–Function–Code (TFC) framework formalizes the decomposition, execution, and evaluation of complex tasks—particularly in tool learning, geospatial reasoning, and data science agent benchmarks—by establishing an explicit three-stage process: translating natural language tasks into well-defined subtasks (Task), mapping these subtasks to function signatures or metric definitions (Function), and constructing executable code (Code) that implements or applies these functions. TFC underpins recent advances in evaluating and enhancing LLM agents by promoting precise task decomposition, reproducible outputs, metric-driven assessment, systematic debugging, and code reuse (Ding et al., 17 Feb 2025, Zhang et al., 19 Feb 2025, Luo et al., 10 Sep 2025).
1. Formal Definitions and Notation
The TFC framework represents processes and evaluations as ordered triples, each capturing the key stages of intelligent agent operation:
- Task (T): Represents the semantic intent or subtask, derived from an initial natural language instruction. In domains like data science, may index categories such as data cleaning, feature engineering, visualization, or modeling (Zhang et al., 19 Feb 2025). In tool learning, corresponds to a user task parsed from a query (Ding et al., 17 Feb 2025). In geospatial automation, denotes parsed subtasks or operations (e.g., buffer generation, spatial filtering) (Luo et al., 10 Sep 2025).
- Function (F): Specifies either an API/function interface with argument schema (for tool invocation or code generation), or a metric/evaluation function that assigns correctness via programmatic rules (Zhang et al., 19 Feb 2025, Ding et al., 17 Feb 2025).
- Code (C): Concrete, runnable code artifact (e.g., a Python function or script) that enacts via , or computes evaluation metrics (Zhang et al., 19 Feb 2025, Ding et al., 17 Feb 2025, Luo et al., 10 Sep 2025).
In formal terms, the TFC pipeline can be denoted:
- For tool learning: , , , with as a repository of reusable code (Ding et al., 17 Feb 2025).
- For benchmarks: , where 0 is the number of evaluation points per prompt (Zhang et al., 19 Feb 2025).
- For multi-agent systems (e.g., GeoAI): 1 (task parsing), 2 (function instantiation), 3 (executed outputs) (Luo et al., 10 Sep 2025).
2. Architectural Workflows and Algorithmic Design
All leading TFC instantiations implement a multi-stage loop involving decomposition, mapping, code synthesis/execution, and feedback:
| Stage | Description | Role in TFC |
|---|---|---|
| Task Parsing | Natural language prompt 4 subtask(s) via structured breakdown | 5 |
| Function Mapping | Subtask 6 function signature (API, schema, or metric) | 7 |
| Code Generation | Pseudocode 8 runnable script, using LLM codegen or retrieval | 9 |
| Execution | Run 0; capture output 1 or error 2 | 3 |
| Reflection | On error, revise pseudocode or code via LLM, retry (yielding robustness) | Feedback/Repair |
| Repository | Store successful 4 pairs for future reuse | Efficiency |
In ToolCoder, for instance, natural language tasks are mapped to function scaffolds with type annotations and descriptive docstrings. These scaffolds are systematically decomposed into stepwise comments, serving as abstract plans. LLMs fill these scaffolds with code, either by generating new routines or by inlining previously successful implementations from the repository 5. The execution environment either yields the final output or a Python traceback, enabling error-localization and iterative self-repair up to a fixed attempt budget. Each successful implementation is stored, reducing the number of redundant LLM inferences and improving overall efficiency (Ding et al., 17 Feb 2025). The GeoJSON Agents pipeline extends this approach to multi-agent coordination, contrasting explicit function calling (API invocation) with dynamic code assembly and sandboxed execution for flexible GIS workflows (Luo et al., 10 Sep 2025).
3. Task Decomposition and Planning Strategies
Systematic decomposition of high-level user tasks is a central tenet of TFC frameworks. Decomposition is operationalized as follows:
- Subtask Extraction: LLM interpreters (e.g., chain-of-thought or hierarchical planning modules) partition complex instructions into a sequence or tree of atomic operations. For example, a prompt "Find the number of movies directed by Sofia Coppola" is decomposed to: search for person, fetch movie credits, filter by job, count (Ding et al., 17 Feb 2025).
- Hierarchical Structuring: Extracted subtasks are formalized as code comments, function signatures, or evaluation points, ensuring transparent, reproducible planning.
- Multi-Agent Planning: Agents such as "Planner" (parsing and assignment) and "Worker" (execution) collaboratively ensure that decomposition and execution proceed in a closed feedback loop with rapid error detection and correction (Luo et al., 10 Sep 2025).
This decomposition underpins detailed program evaluation in benchmarks: each prompt is mapped to a set 6, each associated with a distinct evaluation function and code snippet (Zhang et al., 19 Feb 2025).
4. Code Generation, Execution, and Debugging
TFC frameworks mandate executable realization of planned operations. Two major paradigms are employed:
- Function Interface Invocation: For well-covered subtasks, the system invokes predefined APIs, offering high stability and lower variance (e.g., GeoJSON function calls) (Luo et al., 10 Sep 2025).
- Dynamic Code Synthesis: For open-ended tasks, LLMs generate complete scripts (e.g., producing GPS buffer or GeoPandas code for spatial analysis), executed within sandboxed, resource-controlled environments (Luo et al., 10 Sep 2025, Ding et al., 17 Feb 2025).
- Self-Repair Loop: On code failure, the TFC framework uses traceback-guided reflection—invoking the LLM to revise code based on captured errors. This process iterates up to a maximum attempt threshold, after which failure is reported (Ding et al., 17 Feb 2025, Luo et al., 10 Sep 2025).
A key mechanism is the repository 7 (in ToolCoder), which aggregates successful (interface, implementation) pairs, facilitating rapid code reuse and reducing the computational cost for recurring subtasks (Ding et al., 17 Feb 2025).
5. Evaluation Methodologies and Metrics
TFC-based systems adopt multi-level and fine-grained evaluation, centered on explicit metrics:
- Success Rate: 8, quantifying exact output match against ground truth (Ding et al., 17 Feb 2025).
- Path Correctness: 9, measuring correctness in intermediate calls (Ding et al., 17 Feb 2025).
- Execution Reliability: 0, reporting frequency of error-free runs (Ding et al., 17 Feb 2025).
- Completion Rate (CR): For DataSciBench (Zhang et al., 19 Feb 2025):
1
where 2 is the step score (missing, non-compliant, or compliant).
- Success Rate (SR): Proportion of prompt runs where all TFC passes occur, averaged over repeats.
- Task/Step-Resolved Metrics: Each TFC triple is scored, enabling diagnosis of specific model or agent deficiencies (Zhang et al., 19 Feb 2025).
Empirical results demonstrate that the code-centric TFC approach in ToolCoder yields superior success, accuracy, and path correctness—e.g., 85% success versus 80% for CodeAct on RestBench-TMDB, and higher correctness on API-Bank (Ding et al., 17 Feb 2025). In GeoJSON Agents, explicit code generation achieves 97.14% accuracy versus 85.71% for function-calling agents (Luo et al., 10 Sep 2025).
6. Applications and System Instantiations
TFC has seen adoption in a range of LLM-agent systems:
- ToolCoder: Implements TFC for code-empowered tool learning, using code scaffolding, iterative refinement, reusable repositories, and error trace-based debugging for API-based reasoning. ToolCoder empirically improves execution reliability and task completion, validating the TFC paradigm (Ding et al., 17 Feb 2025).
- DataSciBench: Leverages TFC to formalize and automate the evaluation of LLM agents for data science, enforcing a fine-grained taxonomy of tasks, explicit evaluation functions, and reproducible code-driven traces over 222 “hard” prompts (Zhang et al., 19 Feb 2025).
- GeoJSON Agents: Instantiates TFC in a multi-agent GeoAI setting, partitioning responsibilities into “Planner” (task parsing/mapping) and “Worker” (execution), contrasting structured API invocation with dynamic script generation for geospatial automation (Luo et al., 10 Sep 2025).
These applications capitalize on TFC’s decomposition and metricization to improve agent reliability, transparency, and diagnosability across multi-step, multi-modality pipelines.
7. Best Practices, Limitations, and Outlook
Several operational best practices emerge from cross-domain TFC deployments (Luo et al., 10 Sep 2025):
- Maintain robust API/function libraries for stability in standardized subtasks.
- Provide a controlled sandbox for open-ended code execution and LLM-driven self-debugging.
- Ensure thorough task decomposition (e.g., chain-of-thought or tree-of-thought) to clarify vague NL requirements.
- Monitor resource usage (execution rounds, token context) to balance cost and latency.
- Consider hybrid switching: route covered subtasks to function libraries, delegate novel ones to code generation pathways.
- Log all subtasks, function mappings, code generations, and errors for reproducibility and systematic audit.
A plausible implication is that widespread adoption of TFC may standardize evaluation practices for LLMs, offering reproducibility and finer-grained failure modes compared with ad hoc benchmarks. However, the effectiveness of TFC depends on reliable LLM planning, complete function coverage, and accurate error-handling routines, with open research directions in robust decomposition for ambiguous tasks and scaling to increasingly complex agent architectures (Ding et al., 17 Feb 2025, Zhang et al., 19 Feb 2025, Luo et al., 10 Sep 2025).