Expert-Agent Pair Data Generation Pipeline

Updated 12 December 2025

The paper introduces a novel paired data generation pipeline that unifies expert demonstrations and agent rollouts to enhance training efficacy.
It details a unified JSON schema and balanced sampling strategy that mitigate data heterogeneity while ensuring reproducibility across environments.
The work provides principled metrics for assessing data quality and downstream agent performance, integrating imitation and reinforcement learning signals.

An Expert–Agent Pair Data Generation Pipeline is a modular framework for constructing high-fidelity datasets comprising interaction records between “expert” (human or oracle) actors and autonomous agents. These pipelines enable effective training, evaluation, and benchmarking of agentic systems, especially in multi-environment, multi-turn, or tool-use tasks. This paradigm standardizes data collection, curation, unification, and quality control, mitigating data heterogeneity and leakage, and enabling effective mixing of imitation learning and reinforcement learning signals. The following sections delineate the canonical architectural stages, schema formalism, sampling strategies, training workflows, and principled metrics that define the state-of-the-art in this domain, with emphasis on unified design patterns exemplified by AgentOhana and related frameworks (Zhang et al., 2024).

1. Paired Data Aggregation and Environment Instrumentation

Paired pipelines ingest two primary data streams: expert demonstrations and agent-generated rollouts.

Expert Demonstrations: Recorded multi-turn trajectories from either human operators (via GUI/CLI) or automated “oracle” scripts. Each step is annotated with metadata (timestamp, task ID, user ID, action, reward, etc.), and can be augmented with ground-truth scripts for environments such as Web navigation or question answering.
Agent Rollouts: Produced by checkpointed policies $\pi_t$ through environment interaction. Each agent step logs analogous fields, augmented by policy logits and exploration flags. Both streams can be collected from diverse environments via wrappers that guarantee standardized output $\{\text{observation}, \text{available\_actions}, \text{reward}, \text{done}, \text{info}\}$ .

This dual collection ensures coverage of both the expert and agent policy spaces and supports hybrid supervision objectives.

2. Unified Schema and Trajectory Representation

A universal JSON-based schema encodes trajectories as sequences of expert–agent paired turns. Each trajectory record includes environment context, random seed, detailed metadata, and arrays of step-wise records. Each step contains the following fields:

$t$ : Discrete step index.
$\text{obs}$ : Raw or processed observation.
$\text{avail\_actions}$ : Enumerated action set.
$\text{expert\_action}$ , $\text{agent\_action}$ : Null-masked for non-applicable roles.
$\text{reward}$ , $\text{done}$ : Task reward and episode termination flag.

Tabulated, a step aligns as:

| traj_id | env | source | t | obsₜ | aₜ^E | aₜ^A | rₜ | done |

Normalizations (per-environment zero mean/unit variance for numeric features) are standard:

$\mu_j = \frac{1}{N_E} \sum_{i:\text{env}=E} x_{ij}$

$\sigma_j = \sqrt{ \frac{1}{N_E} \sum_{i:\text{env}=E} (x_{ij} - \mu_j)^2 }$

$x'_{ij} = \frac{x_{ij} - \mu_j}{\sigma_j + \epsilon}$

This schema enables environment-agnostic loading and batch-level featurization (Zhang et al., 2024).

3. Data Loader Construction and Balanced Sampling

To efficiently support mixed-supervision training, the pipeline employs an iterable DataLoader that probabilistically interleaves different data sources. Each source (expert or agent, per environment) is assigned a raw weight $w_i$ , and the sampling probability is controlled by an exponent $\alpha$ :

$p_i = \frac{w_i^\alpha}{\sum_j w_j^\alpha},\quad 0 \leq \alpha \leq 1$

$\alpha=1$ yields proportional sampling, while $\alpha=0$ gives uniform mixing. Streaming from pre-shuffled iterators and dynamic reweighting ensures the desired expert–agent ratio (e.g., $w_{\text{expert}} \approx w_{\text{agent}}$ for 1:1 mixing). Pseudocode is provided in the canonical implementation, enabling reproducibility (Zhang et al., 2024).

4. Partitioning, Randomness Control, and Data Integrity

Data partitioning is performed at the trajectory level using a deterministic hash split over $\text{trajectory\_id}$ , ensuring consistency and zero overlap between training, validation, and test sets across devices:

$h = \operatorname{SHA256}(\text{trajectory\_id}) \mod 100$

Assignment to splits proceeds via thresholding on $n_{\text{train}}$ , $n_{\text{val}}$ .

Randomness for data sampling is managed per-device: a global seed and device rank are hashed for a local seed, used as the RNG key:

$\text{local\_seed} = \operatorname{hash}(\text{global\_seed}, \text{device\_rank})$

This ensures deterministic data access and reproducible runs on distributed or asynchronous training setups.

5. Unified Training Loop: Imitation and Reinforcement Learning

The training pipeline simultaneously supports supervised imitation on expert data and reinforcement (rollout) loss on agent-collected data, with curriculum flexibility dictated by mixing coefficients $\lambda_{\text{im}}$ , $\lambda_{\text{rl}}$ :

$\mathcal{L}_{\text{total}} = \lambda_{\text{im}} \mathcal{L}_{\text{im}} + \lambda_{\text{rl}} \mathcal{L}_{\text{rl}}$

$\mathcal{L}_{\text{im}}$ : Cross-entropy loss between model logits and expert actions.
$\mathcal{L}_{\text{rl}}$ : Policy gradient (e.g., REINFORCE) or variant computed over agent rollouts.

A mini-batch mixer separates and featurizes each source for forward/backward passes, incrementally blending RL as training progresses (early epochs: $\lambda_{\text{rl}} \approx 0$ , late epochs: $\lambda_{\text{rl}} \uparrow 1$ ) (Zhang et al., 2024).

6. Metrics for Data Quality and Agent Evaluation

Metrics cover both the statistical properties of the data and the end-to-end success of downstream policies.

A. Data Quality:

State Diversity: $|\mathcal{S}_{\text{data}}| / |\mathcal{S}_{\text{all}}|$ , fraction of unique states sampled.
Action Coverage: $|\mathcal{A}_{\text{data}} \cap \mathcal{A}_E| / |\mathcal{A}_E|$ , for environment $E$ .
Expert-Agent Distribution Shift: Hellinger distance $H^2$ between $P_E(a|s)$ and $P_A(a|s)$ :

$H^2 = \frac{1}{2} \sum_a \left( \sqrt{P_E(a|s)} - \sqrt{P_A(a|s)} \right)^2$

Averaged over a held-out state set.

B. Downstream Agent Performance:

Success Rate (SR): $(\# \text{episodes with } r_{\text{final}} \geq \text{threshold}) / (\# \text{episodes})$
Average Return ( $\bar{R}$ ): $E[\sum_t r_t]$
Imitation Gap (IG): $E_s [I(a_{\text{agent}} \neq a_{\text{expert}})]$
Sample Efficiency: Number of environment steps or updates to reach a target $\bar{R}^*$

Logging these metrics over curriculum stages enables principled tuning of data mixing, curriculum scheduling, and model capacity, supporting systematic ablation and optimization (Zhang et al., 2024).

7. Practical Trade-Offs and Limitations

The pipeline’s modularity supports expansion to heterogeneous environments and multi-task scenarios. Early training heavily weights imitation to bootstrap agent competence in sparse-reward or long-horizon settings; gradual integration of reinforcement rollouts enables self-improvement. However, the approach presumes access to expert or oracle trajectories of sufficient coverage and may require domain-specific feature normalization. Distributional shift between expert and agent policies can introduce covariate shift if not carefully monitored via state-action coverage and divergence metrics. Extensions may instantiate domain-specific fields in the schema, or implement additional logging, filtering, and validation at the collection and sampling stages to align data properties with the intended deployment regime.

This overview provides a formal and technical account of the Expert–Agent Pair Data Generation Pipeline as realized in unified agentic frameworks, with canonical workflow, schema, sampling, training, and evaluation procedures (Zhang et al., 2024).

Markdown Report Issue Upgrade to Chat

References (1)

AgentOhana: Design Unified Data and Training Pipeline for Effective Agent Learning (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Expert-Agent Pair Data Generation Pipeline.

Expert-Agent Pair Data Generation Pipeline

1. Paired Data Aggregation and Environment Instrumentation

2. Unified Schema and Trajectory Representation

3. Data Loader Construction and Balanced Sampling

4. Partitioning, Randomness Control, and Data Integrity

5. Unified Training Loop: Imitation and Reinforcement Learning

6. Metrics for Data Quality and Agent Evaluation

7. Practical Trade-Offs and Limitations

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Expert-Agent Pair Data Generation Pipeline

1. Paired Data Aggregation and Environment Instrumentation

2. Unified Schema and Trajectory Representation

3. Data Loader Construction and Balanced Sampling

4. Partitioning, Randomness Control, and Data Integrity

5. Unified Training Loop: Imitation and Reinforcement Learning

6. Metrics for Data Quality and Agent Evaluation

7. Practical Trade-Offs and Limitations

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research