Agent Trajectory Data Protocol

Updated 4 July 2026

ATDP is a standardized representation that captures RL-step-grade agent data by recording full decision contexts at each step.
It preserves detailed event fields—including observable status, hidden state, actions, outcomes, and reward signals—for precise credit assignment.
The protocol unifies heterogeneous agent logs into versioned, replayable trajectories, supporting continuous, governed, and online learning improvements.

Agent Trajectory Data Protocol (ATDP) denotes a standardized, vendor-neutral, RL-step-grade trajectory data protocol for agent systems. Its stated purpose is to turn deployed agent interactions into typed, auditable, replayable, credit-assignable event sequences that can be used for online reinforcement learning and broader self-evolution. In the formulation introduced in "Next-Generation Agentic Reinforcement Learning Systems Enable Self-Evolving Agents," ATDP is positioned as a learning-oriented representation layer that preserves the agent decision process at step granularity, in contrast to ordinary agent logs that are primarily used for observability and debugging (Yan et al., 1 Jul 2026).

1. Definition and motivation

ATDP is introduced to address a specific systems deficiency in enterprise agent deployment: current agentic RL systems and surrounding observability stacks are described as inadequate because they lack a standardized trajectory protocol capable of carrying RL learning signals at step granularity across heterogeneous agent paradigms. The motivating claim is not that better RL algorithms are absent, but that the deployed data substrate is not learning-ready (Yan et al., 1 Jul 2026).

The contrast with conventional logging is central. Existing agent logs typically record prompts, completions, tool calls, latency, errors, and token usage. These records are useful for debugging, but they omit properties that ATDP treats as necessary for online RL: step-level decision context, hidden or harness state relevant to the decision, outcomes of actions, delayed rewards or critiques, provenance and versioning, governance metadata, replayability, and credit assignment structure. ATDP therefore solves a representational problem: converting heterogeneous agent execution traces into a substrate that is structured, replayable, auditable, governed, and suitable for reward assignment and policy improvement (Yan et al., 1 Jul 2026).

A common misconception is to treat ATDP as a more verbose logging schema. The protocol is defined more narrowly and more technically than that. Its role is to preserve the full lifecycle of the agent decision process in a form that can support online learning, delayed feedback attachment, and governed replay, rather than merely post hoc inspection.

2. Event semantics and formal model

The prototype formalization represents an agent trajectory as a sequence of step events:

$\bm{\tau} = (\mathbf{e}_1, \mathbf{e}_2, \ldots, \mathbf{e}_T)$

Each step event is defined as:

$\mathbf{e}_t = \langle \mathbf{o}_t, \mathbf{h}_t, \mathbf{a}_t, \mathbf{y}_t, r_t, \mathbf{m}_t \rangle$

The fields are intended to capture both the externally visible interaction and the learning-relevant execution envelope.

Field	Semantics	Examples
$\mathbf{o}_t$	observable status	tool outputs, retrieval snippets, user message, environment state
$\mathbf{h}_t$	hidden internal status	followed plan, scratchpad, confidence, reasoning summary
$\mathbf{a}_t$	chosen action	typed tool call, generated message, code edit, memory update
$\mathbf{y}_t$	action outcome	tool return, user accept/edit/retry/delete, exit code
$r_t$	reward signal	binary outcome, scalar score, natural-language critique, implicit signal
$\mathbf{m}_t$	metadata	latency, tokens, cost, tenant, session, harness fingerprint, model id

The paper notes that this schema reduces to a standard partially observable Markov decision process abstraction when $h_t = \emptyset$ and metadata is dropped. ATDP is nevertheless defined as richer than that reduction because it includes LLM-specific artifacts such as reasoning traces, retrieval snippets, tool schemas, human corrections, natural-language critiques, and rejected actions. This enrichment is what makes the representation “RL-step-grade” rather than merely sequence-logged (Yan et al., 1 Jul 2026).

The formalization also clarifies the temporal semantics of reward. Reward is not restricted to an immediate scalar emitted at action time. It may be binary, scalar, natural-language critique, or implicit signal extracted from an outcome, and it may be updated or augmented later while preserving the original causal record.

3. Design principles

The protocol is organized around six explicit design principles.

Decision-relevant bounded revelation: ATDP should record enough to improve behavior, but not require exposing every internal token or chain-of-thought fragment.
Unification across frameworks and tasks: ATDP is intended to unify heterogeneous agent logs and task-specific RL datasets into a common event record.
Credit assignability: The trajectory must support questions such as which observation mattered, which retrieval result mattered, which tool call mattered, and which guardrail decision mattered.
Late-bound learning signals: Reward and critiques may arrive later than the action, so reward fields should be updateable or augmentable later while preserving the original causal record.
Versioned replayability: Each event must be attributable to the exact execution envelope, including harness schema, tool version, retrieval index snapshot, guardrail configuration, and policy LLM version or checkpoint.
Governed observability: Governance and compliance fields should be present from the start, including redaction status, data classification labels, tenant identifiers, retention policy, consent or legal basis, human-review status, and training eligibility (Yan et al., 1 Jul 2026).

These principles give ATDP its particular semantics. Credit assignment is not treated abstractly; for tool calls, the protocol is expected to store tool version, argument schema, permission scope, latency, error class, return object, and whether the result was later trusted, ignored, corrected, or contradicted. Late-bound rewards are likewise motivated by concrete sources of feedback such as user corrections in later turns, failing tests, human annotations, and slower remote evaluators. Governed observability introduces split visibility: redacted traces for production debugging and sealed or policy-approved fields for training jobs (Yan et al., 1 Jul 2026).

4. Position in next-generation agentic RL stacks

ATDP is the first pillar in a three-pillar architecture for self-evolving agent deployment. The three pillars are ATDP, which specifies what should be represented; a data proxy, which specifies how it is captured in production; and an evolution control plane, which specifies when and how the captured trajectories are used for improvement. ATDP is therefore the semantic and structural foundation on which the other two layers depend (Yan et al., 1 Jul 2026).

The conceptual flow is described in six stages:

Agent interacts with environment, tools, or users.
A data proxy intercepts execution at stable boundaries.
Intercepted events are emitted as ATDP trajectories.
Trajectories are redacted, governed, persisted, and annotated.
Trajectories are replayable and training-ready.
The evolution control plane consumes ATDP windows to decide updates.

The control plane observes a window of ATDP trajectories,

$\bm{\mathcal{D}_t} = \{\bm{\tau}_i\}_{i=t-W}^{t},$

and uses trajectory statistics to trigger interventions such as memory updates, skill patches, harness edits, tool-schema changes, policy updates, rollback, or no-op. This makes ATDP the substrate for governed intervention rather than a passive archival format (Yan et al., 1 Jul 2026).

AReaL2.0 is presented as a prototype instantiation of one branch of this architecture, specifically the policy-update branch. It reorganizes existing RL infrastructure into an online RL service architecture with four components: Gateway, Router, Data Proxy, and Agent-Compute Worker. In the Hermes case study, an inference backend such as SGLang is replaced by an AReaL2.0-managed agent-compute worker via the gateway, while the surrounding agent service remains mostly unchanged; AReaL2.0 intercepts the interaction stream, records trajectories, and connects them to the online RL loop (Yan et al., 1 Jul 2026).

5. Relation to adjacent protocol efforts

ATDP is explicitly distinguished from ordinary observability and logging formats. Those formats are characterized as debugging- and reliability-oriented and are said often to omit causal decision context, versioned execution envelope, reward and critique information, replay boundaries, and governance fields. Without ATDP, trajectories remain incomplete, non-reproducible, unsafe to train on, and weak for credit assignment (Yan et al., 1 Jul 2026).

The closest named neighboring effort is Agent Data Protocol (ADP). ADP is described as a lightweight “interlingua” for unifying heterogeneous LLM-agent datasets across tool use, browsing, coding, software engineering, and general workflows, and it operationalizes this role through a trajectory schema built from alternating actions and observations (Song et al., 28 Oct 2025). ATDP goes further in a different direction: it targets deployed self-evolving agents and preserves step-level causal context, delayed reward signals, action outcomes, harness and tool versions, governance metadata, replay boundaries, and learning eligibility (Yan et al., 1 Jul 2026). In short, ADP standardizes datasets for downstream training pipelines, whereas ATDP standardizes deployed experience for enterprise online learning.

A second nearby line of work concerns multi-agent communication with internal trajectory data. "Augmenting Multi-Agent Communication with State Delta Trajectory" transmits natural-language tokens together with token-wise hidden-state transition trajectories via State Delta Encoding (SDE), with the aim of preserving reasoning information that natural language alone may lose (Tang et al., 24 Jun 2025). This suggests a conceptual adjacency rather than an identity: SDE addresses communication between agents during inference, while ATDP addresses the representation of agent experience for learning, replay, and governed evolution. A plausible implication is that mechanisms such as token-aligned state deltas exemplify the kind of hidden internal status or reasoning summary that an RL-grade trajectory protocol may eventually need to represent, but the two proposals operate at different layers of the stack.

6. Scope, constraints, and research significance

ATDP is presented as vendor-neutral and framework-agnostic, and the associated data-proxy discussion states that it must interoperate with LangChain, LangGraph, CrewAI, OpenAI Agents SDK, Claude Agent SDK, MCP-connected tools, and custom orchestration code. The scope is therefore heterogeneous by design: the protocol is intended to act as a common trajectory layer across different agent stacks rather than to encode the assumptions of a single framework (Yan et al., 1 Jul 2026).

Its constraints are equally explicit. ATDP is not defined as unrestricted exposure of internal reasoning. The principle of decision-relevant bounded revelation rejects the requirement to record every internal token or chain-of-thought fragment. Nor is AReaL2.0 described as a complete implementation. The prototype is said still to lack complete ATDP support with step-level decision context and governance metadata, comprehensive capture of tool, retrieval, memory, file, browser, and human-feedback events, replay and counterfactual evaluation support, tenant-aware privacy and training eligibility enforcement, and automatic multi-surface evolution (Yan et al., 1 Jul 2026).

The research significance of ATDP lies in the shift it enables in the paper’s broader argument: from static deployment plus manual retraining to continuous, governed, trajectory-driven learning. That shift depends on three conditions being simultaneously satisfied: a standard trajectory protocol, a governed capture layer, and an automatic intervention layer. Within that triad, ATDP is the protocol that makes the first condition possible and supplies the consistent data representation consumed by the other two (Yan et al., 1 Jul 2026).