TODSynth: Task-Oriented Data Synthesis Framework

Updated 25 December 2025

TODSynth is a framework that generates structured synthetic data by integrating explicit task hierarchies and control signals.
It employs modular pipelines such as state graphs, subtask chains, and partition trees to finely control data difficulty, semantic accuracy, and diversity.
Empirical benchmarks show that TODSynth reduces annotation costs while improving downstream performance across dialogue systems, generalist agents, and remote sensing applications.

The Task-Oriented Data Synthesis Framework (TODSynth) comprises a family of scalable methodologies for generating structured, high-quality, and task-compliant synthetic data tailored to task-oriented applications such as dialogue systems, generalist agents, and domain-specific model training. TODSynth explicitly incorporates the task structure and control signals into the generation loop, enabling fine-grained regulation of data difficulty, coverage, semantics, and downstream utility.

1. General Architecture and Core Design Patterns

TODSynth instantiates a series of formally grounded, highly modular multi-component pipelines. Each variant is characterized by the explicit encoding and exploitation of target task structure within the synthesis workflow. Typical components include:

Task/State Graphs or Chains: The task decomposition is governed either by explicit state-transition or action graphs (as in SynTOD (Samarinas et al., 23 Apr 2024), GraphTOD (Medjad et al., 21 Jan 2025)), subtask chains (AgentSynth (Xie et al., 17 Jun 2025)), or tree-based partitions (TreeSynth (Wang et al., 21 Mar 2025)).
Controllable Generative Orchestrators: LLMs, diffusion models, or equivalent sequence generators act under the conditioning of intermediate representations such as states, actions, personas, or masks (e.g., MM-DiT with tri-attention (Yang et al., 18 Dec 2025)).
Rollout/Execution/Logging Loops: For generalist agents, the system samples or executes action trajectories, logging state–action–response triples at each step.
Task–Trajectory Bundles: Data output is commonly represented as composite pairs $(\tau, D)$ , where $\tau$ is a high-level task instruction and $D$ is a set of corresponding solution trajectories or dialogues.

Algorithmic control is achieved via rule-based advances through the graph/tree/chain and by dynamic interaction of multiple trained or prompted agents simulating both the user and system sides or the environment. For complex domains such as remote sensing, mask- and text-conditioned generative backbones (e.g., MM-DiT) provide structured image synthesis (Yang et al., 18 Dec 2025).

2. Formalism, Data Structures, and Task Composition

TODSynth frameworks formalize the synthesis problem as follows:

Task Definition: Each data sample is associated with a task $T = (\tau, D)$ , where $\tau \in$ Text is a compositional, natural language instruction (via task summarizer), and $D$ is the set of demonstration trajectories satisfying $\tau$ (Xie et al., 17 Jun 2025).
Task Segmentation: In subtask-based frameworks, a long-horizon instruction is decomposed into $N$ subtasks $T_{sub,0}, \ldots, T_{sub,N-1}$ . The composition function $\mathrm{Summarize}(T_{sub,0}, \ldots, T_{sub,d-1})$ yields progressively more challenging instructions by increasing $d$ .
Trajectory Representation: A trajectory is a sequence $\xi = \{(s_t, a_t)\}_{t=1}^L$ of states (e.g., environment screenshots) and executed actions.
Graph-Based Dialogues: State-transition graphs $G = (S, A, T)$ define permissible states, actions, and transitions for dialogue systems, enabling sampling of valid and diverse task-oriented conversations (Samarinas et al., 23 Apr 2024, Medjad et al., 21 Jan 2025).
Partition Trees: In TreeSynth, the space of possible instructions is recursively partitioned into $M$ leaf subspaces $S_i$ using LLM-guided splitting, ensuring mutual exclusivity and joint coverage (Wang et al., 21 Mar 2025).

Difficulty control is achieved by varying the number of subtasks summarized ( $d$ ), the depth or breadth of a walk or partition, or the complexity of masks/prompts. Unified data structures support flexible downstream use, with each synthetic instance precisely annotated by its compositional history and context.

3. Algorithmic Workflow and Implementation Paradigms

The generative workflow in TODSynth follows structured, iterative procedures:

Initialization: Inputs include user-specified schema (persona, ontology, graph/tree structure, masks), optional real seed data, and LLM or diffusion model checkpoints.
Iterative Generation: At each step, components interact as per a predefined protocol:
- Task Proposer/Follow-Up Agent generates or selects the next atomic instruction (Xie et al., 17 Jun 2025).
- Execution/Simulation Agents enact the specified step, grounded to the environment, and log all transitions.
- Verifier and Reviser modules correct failed executions, loop back as needed.
- Summarization Agents combine sequences of subtasks into composite instructions, directly controlling task horizon and complexity.
Data Synthesis and Postprocessing: The output dataset consists of natural language instructions, full action/state trajectories, and associated context metadata. Mask-conditioned frameworks further generate (image, mask) pairs under triple-modality attention, followed by plug-and-play rectification via Control-Rectify Flow Matching (CRFM) to optimize semantic alignment (Yang et al., 18 Dec 2025).

Implementation is modular, supporting plug-and-play alternation of components, compatibility with multiple generative model families, and extension by custom postprocessing or filtering pipelines. Joint logging of intermediate outputs and meta-information ensures interpretability and reproducibility.

4. Task Difficulty, Control Mechanisms, and Cost Models

A hallmark of TODSynth is fine-grained, explicit control over the difficulty and nature of synthesized data:

Difficulty Scaling by Composition: For chained subtasks, increasing $d$ (the number of summarized steps) creates a spectrum of tasks with measured difficulty escalation, as quantified by success rates of SOTA LLM agents (Xie et al., 17 Jun 2025).
Partition-Based Diversity Control: TreeSynth's recursive stratification into mutually exclusive subspaces allocates sampling resources evenly or proportionally, countering space collapse and achieving both diversity and coverage (Wang et al., 21 Mar 2025).
Sampling Control via Task Feedback: CRFM introduces dynamically adjusted velocity fields in diffusion trajectories, integrating downstream segmentation loss gradients for early-step correction—improving relevance for semantic segmentation (Yang et al., 18 Dec 2025).
Prompt and Walk Policies: In dialogue-oriented settings, graph-guided multi-prompt designs enforce rare intent coverage, slot filling robustness, and lexically diverse outputs (Samarinas et al., 23 Apr 2024, Medjad et al., 21 Jan 2025).

Token and inference cost models are reported, e.g., AgentSynth achieves $\$0.60 $per high-quality trajectory, amortized to$ \$0.10 $per task, orders of magnitude below human annotation (<a href="/papers/2506.14205" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">Xie et al., 17 Jun 2025</a>). The cost is driven by LLM inference, trajectory length, and postprocessing overhead. <h2 class='paper-heading' id='empirical-benchmarks-and-impact-on-downstream-learning'>5. Empirical Benchmarks and Impact on Downstream Learning</h2> TODSynth has been rigorously evaluated in diverse domains: <ul> <li>Task Success and Benchmark Difficulty: Agent performance degrades rapidly with increased composite task length: e.g., GPT-4.1 success rate falls from$ 16\% $to$ 0\% $between difficulty levels 1 and 6, while human success persists at about$ 70\% $for the toughest tasks (<a href="/papers/2506.14205" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">Xie et al., 17 Jun 2025</a>).</li> <li>Dialogue Dataset Quality: Graph-guided and graph-based simulation frameworks consistently outperform naive prompt-based baselines, e.g., intent classification accuracy$ 95.8\% $(graph-guided) vs$ 76.2\% $(single-prompt) (<a href="/papers/2404.14772" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">Samarinas et al., 23 Apr 2024</a>).</li> <li>Remote Sensing Segmentation: MM-DiT with tri-attention and CRFM boosts mean IoU and mean accuracy over both real-only baselines and alternative synthetic generators, notably in few-shot and complex-scene settings (<a href="/papers/2512.16740" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">Yang et al., 18 Dec 2025</a>).</li> <li>Dataset Diversity and Coverage: TreeSynth demonstrates up to$ 17.6\% $absolute downstream improvement and$ 45.2\%$ reduction in average pairwise sample similarity compared to evol-instruct and temperature sampling, with near-linear scaling as sample budget increases (Wang et al., 21 Mar 2025).

Unified evaluation protocols, including intent/slot metrics, downstream model performance, and coverage/diversity scores, are systematically applied.

Method	Main Application	Key Control Mechanism
AgentSynth (Xie et al., 17 Jun 2025)	Generalist agent tasks	Subtask chaining + summarization
SynTOD (Samarinas et al., 23 Apr 2024)	Dialogue systems	State-transition graph walks
GraphTOD (Medjad et al., 21 Jan 2025)	Dialogue generation	JSON state graphs + dual agents
TreeSynth (Wang et al., 21 Mar 2025)	All task domains	Global attribute partitioning
TODSynth (Yang et al., 18 Dec 2025)	Semantic segmentation	Triple-attention MM-DiT + CRFM

6. Exemplary Tasks, Realizations, and Domain Generality

TODSynth has produced extensive benchmarks across application domains:

Composite Generalist Agent Tasks (AgentSynth): Example chains require sequential web search, cross-application GUI operations, calendar event entry, and multi-level reasoning. Tasks are designed to expose agent limitations, quantify horizon-dependent difficulty, and integrate several application domains (web, OS, office, coding, research) (Xie et al., 17 Jun 2025).
Dialogues with Controllable Schema/Grounded Fusions: SynTOD, GraphTOD, and ODD-TOD combinatorial simulators yield rich, schema-driven dialogue datasets, supporting rare intent injection and cross-domain evaluation (Samarinas et al., 23 Apr 2024, Li et al., 2022, Medjad et al., 21 Jan 2025).
Remote Sensing Synthesis: The framework generates high-resolution aerial imagery and mask pairs under complex semantic control, outperforming prior art in complex-scene and domain adaptation settings (Yang et al., 18 Dec 2025).
Instruction Diversity: TreeSynth’s coverage-driven approach produces exploration of underrepresented modes and robust attribute-balanced trainsets for code, mathematics, and reasoning tasks (Wang et al., 21 Mar 2025).

This cross-domain generality underscores the flexibility of the TODSynth paradigm—essential for transferability to emergent, data-scarce, or high-complexity workflows.

7. Limitations, Open Problems, and Extensibility

Although TODSynth delivers state-of-the-art synthesis quality and control, several limitations remain:

Manual Graph/Tree Construction: For domains with large or nontrivial schemas, encoding state graphs or partitioning heuristics can be a bottleneck. Automated structure discovery remains an open research direction (Samarinas et al., 23 Apr 2024, Wang et al., 21 Mar 2025).
LLM Cost and Bias: The quality and balance of generated data are functions of LLM design, LLM sampling noise, and inference cost; large-scale rollouts remain dependent on API efficiency and prompt engineering (Wang et al., 21 Mar 2025).
Real-World Generalization: Though synthetic data improves measured downstream performance, the gap to performance on genuinely unseen real data, or in highly dynamic environments, persists.
Extension to Continuous/Multimodal Spaces: Current partitioning and control mechanisms are better suited to discrete or finitely describable feature spaces; extensions to continuous-valued or richly multimodal attributes (e.g., videos) are under active development (Wang et al., 21 Mar 2025, Yang et al., 18 Dec 2025).

Extensibility is supported by modular architecture, documentation, and open-source release, enabling adaptation to new modalities, domains, and evaluation criteria.

TODSynth establishes a systematic paradigm for task-aware, controllable, and empirically validated data synthesis, advancing both practical dataset construction and the underlying science of task-aware generative modeling. Its range—from complex multi-agent computer-use logs and dialogue flows to remote sensing semantic segmentation—demonstrates broad applicability and operational rigor (Xie et al., 17 Jun 2025, Samarinas et al., 23 Apr 2024, Yang et al., 18 Dec 2025, Medjad et al., 21 Jan 2025, Wang et al., 21 Mar 2025).