Usable Agent Discovery for Decentralized AI Systems

Published 25 Apr 2026 in cs.MA, cs.AI, and cs.DC | (2604.23080v1)

Abstract: Large-scale agentic systems run on distributed infrastructures where many software agents share physical hosts and are discovered via peer-to-peer mechanisms. Discovery must handle node-level churn from failures and host departures and agent-level churn from demand-driven activation, deactivation, and state changes. Their interaction reshapes classic trade-offs between structured and unstructured overlays. We study decentralized agent discovery under this two-level churn, assuming nodes host multiple agents, overlays are structured or gossip-based, and agents switch between warm and cold states. Using Kademlia as a structured and Cyclon+Vicinity as a gossip baseline, we compare stable, node-churn-only, agent-cooling-only, and combined regimes to see when routing efficiency, resilience, and service readiness align or favor different designs. Structured overlays are more robust and efficient in stable and node-churn regimes, while gossip-based overlays remain competitive and can be faster when readiness dominates.

Abstract PDF Upgrade to Chat

Authors (4)

Summary

The paper introduces a model distinguishing node-level churn from agent readiness to optimize decentralized agent discovery.
It evaluates structured overlays (Kademlia) versus gossip-based protocols (Cyclon+Vicinity), highlighting trade-offs in maintenance cost and service latency.
Empirical results indicate that selecting the proper overlay based on churn and agent warmness policies can significantly enhance useful availability.

Usable Agent Discovery for Decentralized AI Systems

Problem Framework and Motivation

Distributed agentic infrastructures increasingly rely on decentralized discovery mechanisms, which must robustly resolve not only network-level reachability but also agent-level usability amid dynamic state changes. Discovery entails identifying an execution target able to deliver an advertised skill; however, agent lifecycle policies (e.g., cooling after inactivity and cold starts) interact with overlay structure and node-level churn (failures/joins) to confound classic trade-offs between structured (e.g., DHT-based) and unstructured (gossip-based) overlays. The paper introduces a separation between node-level churn and agent-level readiness, proposing a model where each agent transitions among warm, cold, and off states, and detailing how these dynamics inform the efficacy and cost-quality trade-offs of decentralized skill-based discovery protocols.

Model and Metrics

The system organizes agents as co-residents on physical nodes within a time-varying overlay graph, either structured (Kademlia) or gossip-based (Cyclon+Vicinity). Each agent’s runtime state governs its responsiveness; cold agents incur startup latencies, while off agents are unreachable, possibly resulting from stale directory records. Discovery queries target skill capabilities rather than agents, and a successful route depends on both infrastructure robustness and agent readiness.

The paper defines end-to-end latency as the aggregate of discovery, routing, and potential startup delays:

$L(q,a) = L_{\text{disc}}(q) + L_{\text{route}}(q,a) + L_{\text{start}}(a)$

Success rate $S(q,a)$ quantifies whether a discovered agent is reachable. Crucially, the "useful availability" metric $U_\Delta$ captures whether service is delivered within a meaningful deadline, operationalized as the expected fraction of requests served within latency budget $\Delta$ .

Experimental Regimes and Design

Simulations systematically interrogate five principal factors: overlay type, node-level churn (with controlled session/downtime cycles), agent cooling (with variable warm/cold/suspended periods), cold-start cost, and workload structure. Four core regimes are evaluated:

Stable Baseline (E1): Minimal churn, most agents remain warm.
Node Churn Only (E2): Progressive node failures/recoveries, agents retained warm.
Agent Cooling Only (E3): Stable nodes, aggressive cooling of agents after inactivity.
Combined Churn (E4): Node churn and agent cooling co-occur, yielding nonlinear interactions.

Secondary workload sensitivity analyses (E5) explore effects of semantic sparsity and agent specialization.

Empirical Regime Mapping and Key Findings

Structured overlays (Kademlia) demonstrate robust lookup and higher success rates under stable and node-churn regimes. However, maintenance overhead is substantially higher compared to gossip-based overlays, which feature lower cost but degrade gracefully under churn and are often preferred when service readiness (i.e., avoidance of cold starts) dominates end-to-end usability.

Figure 1: Latency/overhead trade-off across the main operating regimes, highlighting divergent cost-quality profiles for structured versus unstructured overlays.

Cyclon+Vicinity retains competitive latency and overhead profiles in moderate-to-aggressive churn settings, and often surpasses Kademlia in regimes where agent cooling is pronounced. The key separation is not purely in success rate but in how maintenance burden is managed: structured overlays pay for index republishing and freshness, while unstructured overlays rely on demand-driven gossip traffic.

Figure 2: Maintenance vs request-plane cost illustrates that structured overlays require substantially more background maintenance.

Success rates deteriorate as instability intensifies, but the trade-off becomes regime-dependent: structured overlays are more robust in maintaining discovery success, yet fail to convert this into usable service when agent warmness policies and cold-start costs dominate.

Figure 3: Success rate across regimes, illustrating robustness advantage under node churn for structured overlays.

The useful availability metric $U_\Delta$ exposes when routing robustness translates into timely service and when agent readiness is the ultimate bottleneck. Kademlia is dominant for tight latency budgets in stable and node-churn regimes; under aggressive cooling, Cyclon+Vicinity yields superior $U_\Delta$ as cold-start delays swamp routing precision.

Figure 4: Useful availability $U_\Delta$ vs budget $\Delta$ , illustrating that agent readiness can dominate routing superiority.

Impact of Semantic Sparsity and Specialization

Semantic density effects, evaluated via agent skill specialization, are pronounced only in extreme scenarios—when each agent exposes a unique skill. Otherwise, both structured and unstructured overlays converge in performance; generalist agents shrink the gap in p95 latency and messaging overhead.

Figure 5: p95 latency and overhead from specialists to generalists, showing diminishing semantic effects as agents become less specialized.

Targeted Probes: Maintenance and Staleness Effects

Fine-grained maintenance probes reveal that increasing maintenance cost does not yield proportional improvements in usable discovery. Aggressive gossip or index republishing drastically raises background traffic with marginal gains in latency or success.

Figure 6: Targeted maintenance probe with 95\% confidence intervals, demonstrating steep maintenance-cost increases with limited quality improvement.

Staleness diagnostics show that failures in combined churn are primarily attributable to stale routing (overlay-level decay), not merely loss of host-belief freshness. Structured overlays recover when host-belief is stressed, but collapse under routing staleness despite increased maintenance traffic; unstructured overlays maintain partial discovery but do not achieve decisive quality advantage.

Figure 7: Targeted staleness diagnostics with 95\% confidence intervals, attributing breakdowns predominantly to stale routing rather than stale host beliefs.

Implications and Future Directions

The study rejects the existence of a universally superior overlay for agent discovery. Structured overlays are preferred where robustness and stringent latency requirements are paramount; gossip overlays are favored in cost-sensitive, readiness-dominated environments. Overlay selection must consider churn profiles and SLOs, not asymptotic lookup costs alone. The distinction between success and useful availability is foundational for practical orchestration.

Theoretical implications extend to hybrid designs that leverage structured overlays in stable periods and unstructured overlays for recovery during overlay decay. Practical directions involve request-aware lifecycle management—aligning agent warmness with overlay health and anticipated demand—making $U_\Delta$ not just a diagnostic metric but a dynamic control target.

Conclusion

The separation of node-level churn and agent-level readiness defines a new evaluative landscape for decentralized agent discovery, where overlays traverse distinct cost-quality frontiers. Structured overlays deliver higher success and robustness but at elevated maintenance cost; unstructured overlays facilitate lower latency and overhead, excelling under readiness constraints. Useful availability $U_\Delta$ supersedes mere success in diagnosing real deployment quality. Future research should focus on adaptive, hybrid overlay mechanisms and agent lifecycle policy optimization, tightly integrating routing integrity and dynamic readiness for scalable, usable decentralized AI infrastructures.