Expert-Agent Pair Data Generation Pipeline
- The paper introduces a novel paired data generation pipeline that unifies expert demonstrations and agent rollouts to enhance training efficacy.
- It details a unified JSON schema and balanced sampling strategy that mitigate data heterogeneity while ensuring reproducibility across environments.
- The work provides principled metrics for assessing data quality and downstream agent performance, integrating imitation and reinforcement learning signals.
An Expert–Agent Pair Data Generation Pipeline is a modular framework for constructing high-fidelity datasets comprising interaction records between “expert” (human or oracle) actors and autonomous agents. These pipelines enable effective training, evaluation, and benchmarking of agentic systems, especially in multi-environment, multi-turn, or tool-use tasks. This paradigm standardizes data collection, curation, unification, and quality control, mitigating data heterogeneity and leakage, and enabling effective mixing of imitation learning and reinforcement learning signals. The following sections delineate the canonical architectural stages, schema formalism, sampling strategies, training workflows, and principled metrics that define the state-of-the-art in this domain, with emphasis on unified design patterns exemplified by AgentOhana and related frameworks (Zhang et al., 2024).
1. Paired Data Aggregation and Environment Instrumentation
Paired pipelines ingest two primary data streams: expert demonstrations and agent-generated rollouts.
- Expert Demonstrations: Recorded multi-turn trajectories from either human operators (via GUI/CLI) or automated “oracle” scripts. Each step is annotated with metadata (timestamp, task ID, user ID, action, reward, etc.), and can be augmented with ground-truth scripts for environments such as Web navigation or question answering.
- Agent Rollouts: Produced by checkpointed policies through environment interaction. Each agent step logs analogous fields, augmented by policy logits and exploration flags. Both streams can be collected from diverse environments via wrappers that guarantee standardized output .
This dual collection ensures coverage of both the expert and agent policy spaces and supports hybrid supervision objectives.
2. Unified Schema and Trajectory Representation
A universal JSON-based schema encodes trajectories as sequences of expert–agent paired turns. Each trajectory record includes environment context, random seed, detailed metadata, and arrays of step-wise records. Each step contains the following fields:
- : Discrete step index.
- : Raw or processed observation.
- : Enumerated action set.
- , : Null-masked for non-applicable roles.
- , : Task reward and episode termination flag.
Tabulated, a step aligns as:
| traj_id | env | source | t | obsₜ | aₜE | aₜA | rₜ | done |
Normalizations (per-environment zero mean/unit variance for numeric features) are standard:
This schema enables environment-agnostic loading and batch-level featurization (Zhang et al., 2024).
3. Data Loader Construction and Balanced Sampling
To efficiently support mixed-supervision training, the pipeline employs an iterable DataLoader that probabilistically interleaves different data sources. Each source (expert or agent, per environment) is assigned a raw weight , and the sampling probability is controlled by an exponent :
yields proportional sampling, while gives uniform mixing. Streaming from pre-shuffled iterators and dynamic reweighting ensures the desired expert–agent ratio (e.g., for 1:1 mixing). Pseudocode is provided in the canonical implementation, enabling reproducibility (Zhang et al., 2024).
4. Partitioning, Randomness Control, and Data Integrity
Data partitioning is performed at the trajectory level using a deterministic hash split over , ensuring consistency and zero overlap between training, validation, and test sets across devices:
Assignment to splits proceeds via thresholding on , .
Randomness for data sampling is managed per-device: a global seed and device rank are hashed for a local seed, used as the RNG key:
This ensures deterministic data access and reproducible runs on distributed or asynchronous training setups.
5. Unified Training Loop: Imitation and Reinforcement Learning
The training pipeline simultaneously supports supervised imitation on expert data and reinforcement (rollout) loss on agent-collected data, with curriculum flexibility dictated by mixing coefficients , :
- : Cross-entropy loss between model logits and expert actions.
- : Policy gradient (e.g., REINFORCE) or variant computed over agent rollouts.
A mini-batch mixer separates and featurizes each source for forward/backward passes, incrementally blending RL as training progresses (early epochs: , late epochs: ) (Zhang et al., 2024).
6. Metrics for Data Quality and Agent Evaluation
Metrics cover both the statistical properties of the data and the end-to-end success of downstream policies.
A. Data Quality:
- State Diversity: , fraction of unique states sampled.
- Action Coverage: , for environment .
- Expert-Agent Distribution Shift: Hellinger distance between and :
Averaged over a held-out state set.
B. Downstream Agent Performance:
- Success Rate (SR):
- Average Return ():
- Imitation Gap (IG):
- Sample Efficiency: Number of environment steps or updates to reach a target
Logging these metrics over curriculum stages enables principled tuning of data mixing, curriculum scheduling, and model capacity, supporting systematic ablation and optimization (Zhang et al., 2024).
7. Practical Trade-Offs and Limitations
The pipeline’s modularity supports expansion to heterogeneous environments and multi-task scenarios. Early training heavily weights imitation to bootstrap agent competence in sparse-reward or long-horizon settings; gradual integration of reinforcement rollouts enables self-improvement. However, the approach presumes access to expert or oracle trajectories of sufficient coverage and may require domain-specific feature normalization. Distributional shift between expert and agent policies can introduce covariate shift if not carefully monitored via state-action coverage and divergence metrics. Extensions may instantiate domain-specific fields in the schema, or implement additional logging, filtering, and validation at the collection and sampling stages to align data properties with the intended deployment regime.
This overview provides a formal and technical account of the Expert–Agent Pair Data Generation Pipeline as realized in unified agentic frameworks, with canonical workflow, schema, sampling, training, and evaluation procedures (Zhang et al., 2024).