GRADE: Graph Representation of LLM Agent Dependency and Execution

Published 22 Jun 2026 in cs.LG | (2606.22741v1)

Abstract: Can one graph represent every kind of LLM agent's run? A trace records what each step did, never what it relied on, the state it read, and the results it reused. GRADE recovers that missing layer: it models any run as one graph over its step nodes with two edge layers, execution edges (what ran in what order) read from the trace for free, and dependency edges (what each step relied on) rarely logged, so each is graded by how it is known, observed, declared, or inferred. One representation, and each layer earns its place. Across six corpora of LLM agents spanning tool use, coding, and the web, the dependency layer can predict failure where run size is weak and, under leave-one-corpus-out transfer, stays above chance on every held-out class while run size fails. Meanwhile, the execution layer localizes the faulting step in a failed multi-agent run. This work also provides a more in-depth analysis of why generic graph neural networks may misread the dependency layer, unlike our feature-based alternative. The same graph representation opens further uses, carrying from failure diagnosis in a single run to efficiency and robustness optimization at scale.

Abstract PDF Upgrade to Chat

Authors (1)

Yue Zhao

Summary

The paper introduces a dual-layer graph model that explicitly captures execution and graded dependency relations, enabling precise detection of coordination and reliance failures.
It demonstrates that dependency features can boost failure prediction up to +0.142 ROC-AUC compared to traditional run-size predictors, emphasizing structural insights.
The framework’s graded edge approach improves fault localization and promotes transferable, robust analysis across diverse LLM agentic workflows.

GRADE: Unified Graph Representation for LLM Agent Dependency and Execution

Motivation and Problem Formulation

The automation of complex workflows using LLM agents is increasingly characterized by failures rooted in reliance on outdated or incorrect state, which traditional execution traces inadequately document. Prevailing representations focus on stepwise execution, omitting detailed dependency information: namely, what state each step relied upon and how those dependencies were acquired or inferred. The GRADE framework addresses this gap by modeling agentic workflows as directed, typed, temporal multigraphs, explicitly distinguishing execution edges (observable from traces) and dependency edges (graded as observed, declared, or inferred). This dual-layer representation subsumes existing orchestration, computation, and data-lineage graphs as projections, providing a comprehensive substrate for systematic analysis of both coordination and reliance failures.

Formal Representation and Attachment Grading

GRADE operationalizes workflow runs using four node types—agent, decision, tool call, and external resource—with two edge layers:

Execution Layer: Directly recoverable from traces, capturing stepwise progression, agent emission, tool execution, and control handoff.
Dependency Layer: Encodes reliance relations, graded per edge as observed (explicit state accesses), declared (logged via instrumentation), or inferred (postulated via assumptions such as full-history). The edge source grade is tracked per-edge, not per-run, allowing for heterogeneous provenance within a single graph.

This attachment grade is critical; it delineates the epistemic source of each reliance, enabling analysis that faithfully separates structural dependencies from mere projection of execution order or volume.

Failure Modes: Coordination and Reliance

Agentic failures manifest in two primary modes:

Coordination Failures: Emergent from erroneous progression or control handoff; fully represented in the execution layer.
Reliance Failures: Arise when steps act on stale or incorrect state; only visible in the dependency layer. Reliance failures are systematically unlogged in traces, but GRADE facilitates their explicit representation.

GRADE enables detection, diagnosis, and attribution of these failure types within a unified formalism, offering analytic leverage unavailable in prior representations.

Empirical Validation: Marginal-Lift and Transfer Analyses

The central claim is substantiated across six diverse corpora (spanning tool use, coding, and web navigation), rigorously quantifying the marginal lift of dependency features via cross-validated ROC-AUC relative to a run-size baseline:

Within-Corpus Failure Prediction: Dependency shape features significantly enhance failure prediction where run size is an ineffective predictor, yielding lifts up to +0.142 ROC-AUC (SWE-Gym). Where run size dominates, dependency features are redundant but not harmful.
Cross-Corpus Transfer: In leave-one-corpus-out settings, dependency features consistently transfer better than run size, staying above chance on all held-out domains. In contrast, run size systematically inverts on two corpora—demonstrating its lack of portability.
Structural Analysis: The marginal lift is attributed to chain depth, hub concentration, and reliance structure, not simply revisit density. Dependency-only models match run-size predictors on size-dominated domains, confirming redundancy rather than vacuity.

Degenerate Regime and Generic Graph Network Limitations

GRADE elucidates the "degenerate regime," where inferred dependency layers (under full-history assumptions) collapse to deterministic functions of run size, eliminating real structural signal. Saturation ratio $p = \frac{|\mathrm{ED}|}{\binom{n}{2}}$ operationalizes this collapse. Empirical comparison with generic graph neural networks (GIN, R-GCN, HGT) reveals:

Within-Corpus Fit: Relation-aware networks can fit structural data in-distribution.
Poor Transferability: In transfer, source-aware feature models outperform all tested GNNs; relation-aware networks fall below chance on held-out corpora; signal due to run size inverts or collapses.
Structural Mismatch: Standard message-passing architectures lack inductive bias for epistemic attachment grading; they cannot differentiate between observed, declared, and inferred dependencies, leading to failure in portable prediction.

This analysis distinguishes GRADE’s epistemic sensitivity from molecular and dynamic graph learning, where edge grades are scalar weights and do not exhibit collapse properties tied to node count.

Execution Layer: Fault Localization

GRADE's execution layer supports precise fault localization. On the Who&When multi-agent benchmark, agent centrality and handoff position rank faulting steps above position priors and random baselines, demonstrating that execution topology alone can localize coordination faults. The dependency layer's potential for reliance fault localization awaits richer corpora with both dependency and fault labels.

Implications and Future Directions

The graded dependency layer is instrumental for structural analysis, transfer learning, and reliable causal prediction. Practical implications include:

Instrumentation: Lightweight instrumentation to log read/write events elevates dependency edges from inferred to observed, unlocking more robust signals.
Efficiency Optimization: The dependency graph supports pruning of off-path work and identification of dead steps, conditional on observed edge attachment.
Repair & Replanning: Structural features diagnose fragile reliance and inform task-level reconfiguration.
Fleet-Scale Anomaly Detection: GRADE enables anomaly detectors for reliability monitoring, though unsupervised anomaly-detection remains open, as current approaches often collapse to run length and fall into the degenerate regime.

The theoretical agenda includes deployment in larger, more diverse agentic environments (embodied agents, richer web tasks), development of declared-grade corpora with fine-grained step attribution, and exploration of architectures explicitly encoding attachment grading.

Conclusion

GRADE establishes a unified, epistemically graded graph formalism for LLM agentic workflows, concretely subsuming prior orchestration and data-lineage models. By distinguishing execution and dependency layers—and grading their epistemic sources—it exposes both coordination and reliance failures and enables robust, transferable failure prediction beyond naïve run-size indicators. GRADE’s analytic substrate demonstrates superior portability and interpretability relative to generic graph neural network approaches, opening practical avenues for workflow optimization, fault localization, and reliability enhancement. Continued advances depend on richer instrumented corpora and architectures attuned to epistemic source grading.

Markdown Report Issue