Papers
Topics
Authors
Recent
2000 character limit reached

OLAF: Programmable Data Plane Acceleration for Asynchronous Distributed Reinforcement Learning (2507.05876v1)

Published 8 Jul 2025 in cs.NI and cs.AR

Abstract: Asynchronous Distributed Reinforcement Learning (DRL) can suffer from degraded convergence when model updates become stale, often the result of network congestion and packet loss during large-scale training. This work introduces a network data-plane acceleration architecture that mitigates such staleness by enabling inline processing of DRL model updates as they traverse the accelerator engine. To this end, we design and prototype a novel queueing mechanism that opportunistically combines compatible updates sharing a network element, reducing redundant traffic and preserving update utility. Complementing this we provide a lightweight transmission control mechanism at the worker nodes that is guided by feedback from the in-network accelerator. To assess model utility at line rate, we introduce the Age-of-Model (AoM) metric as a proxy for staleness and verify global fairness and responsiveness properties using a formal verification method. Our evaluations demonstrate that this architecture significantly reduces update staleness and congestion, ultimately improving the convergence rate in asynchronous DRL workloads.

Summary

We haven't generated a summary for this paper yet.

Whiteboard

Paper to Video (Beta)

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.