Pathways: Asynchronous Distributed Dataflow

Updated 15 December 2025

Pathways is an asynchronous distributed dataflow system that orchestrates large-scale, heterogeneous computations using sharded dataflow graphs and decentralized scheduling.
It supports both regular SPMD and irregular MPMD workloads, optimizing resource utilization across diverse hardware accelerators like TPUs and GPUs.
The architecture separates control and data planes, leveraging high-bandwidth interconnects and masterless coordination to efficiently manage data movement and latency.

Pathways refers to a class of asynchronous distributed dataflow systems designed for the orchestration of large-scale heterogeneous computations, especially in machine learning and data-intensive workflows. The method generalizes from classic SPMD to more irregular, fine-grained MPMD workloads and supports both synchronous and asynchronous patterns at scale by leveraging sharded dataflow graphs, decentralized scheduling, and fine-grained control/data separation. Recent systems such as Pathways (Barham et al., 2022), DFlow (Shi et al., 2023), and AIR (Venugopal et al., 2020) exemplify different architectural approaches and performance tradeoffs in this field.

1. Architectural Overview and Design Goals

Asynchronous distributed dataflow systems are constructed around the intent to manage and schedule parallel computations efficiently across thousands of hardware accelerators (TPUs, GPUs), networked nodes, or containers. Key architectural goals include:

Workload Flexibility: Seamless support for both regular SPMD (single-program, multiple-data) execution and irregular MPMD (multiple-program, multiple-data) models, e.g., transformer pipelines, Mixture-of-Experts, and sparsely routed models.
Scalability: Capacity to operate over thousands of devices, multiple islands/pods, and heterogeneous accelerators.
Virtualization and Multi-tenancy: Abstract physical topology via virtual devices/slices, enabling dynamic assignment and transparent hardware sharing.
Single-controller or Masterless Operation: Some systems centralize resource allocation and scheduling (Pathways), while others eliminate master nodes and operate peer-to-peer (AIR).
Efficient Data Movement: Coordination of bulk data via high-bandwidth interconnect (ICI, NVLink) and finely-batched control traffic over commodity networks (RDMA, DCN).

Pathways (Barham et al., 2022) features a high-level architecture composed of client interfaces, a single resource manager (controller), a sharded dataflow engine (Plaque), centralized schedulers per accelerator island, per-host executors, and inter-island/high-speed interconnect layers.

DFlow (Shi et al., 2023) introduces separated global and local scheduling, tightly coupled with a distributed object store for efficient, decentralized data exchange, suited for serverless workflows.

AIR (Venugopal et al., 2020) implements purely masterless, peer-to-peer orchestration with all nodes running identical code, using asynchronous iterative routing for dynamic sharding and message passing.

2. Dataflow Graph Modeling and Sharding

The core abstraction is the sharded dataflow graph $G = (V, E)$ , a directed bipartite model where:

Operators $O = \{ o_1, o_2, ... \}$ represent compiled functions or execution units, each possibly sharded across many devices ( $o_{j,i}$ ).
Futures $F = \{ f_1, f_2, ... \}$ denote tensor data tokens, each produced by a single operator.
Edges $E \subset (O \times F) \cup (F \times O)$ describe production and consumption relationships.

An operator $o$ is ready to run at time $t$ iff $\forall f \in Pred(o)$ , $f$ is available. Each data token/future encapsulates device placement and shard routing metadata. Sparse and fine-grained sharding is supported: producers can selectively send outputs to subsets of downstream consumer shards, and readiness is tracked via receipt progress, enabling advanced patterns such as dynamic MoE routing.

AIR (Venugopal et al., 2020) formalizes dynamic sharding using deterministic hashes and window-based aggregation keys across the MPI rank space ( $j = h(x.key) \bmod P$ ), generalizing to stateless and stateful operators.

3. Control and Data Plane Separation

Pathways (Barham et al., 2022) cleanly separates control-plane (scheduling, resource allocation, buffer management) from data-plane (tensor movement, collective operations):

Control-plane traffic traverses the DCN and is managed via a distributed engine (Plaque) that tracks operator dependencies as futures and coordinates scheduling actions.
Data-plane traffic moves tensor payloads directly over high-bandwidth links (ICI for TPUs; NVLink for GPUs), bypassing host involvement wherever possible.

Asynchrony permits speculative preparation and host-side pipelining of function dispatch, overlapping the control path with accelerator execution time, and hiding $O(\mu s \rightarrow ms)$ latencies.

DFlow (Shi et al., 2023) applies similar principles in serverless compute: dataflow-based scheduling triggers container invocation ahead of dependency resolution and leverages in-memory chunked storage (DStore) to block/wake functions precisely as dependent data becomes available.

AIR (Venugopal et al., 2020) eschews any centralized barrier or master node, using listener, processor, and sender threads per channel, strictly synchronized via local pthread mutexes and condition variables, with all interprocess coordination handled through tagged MPI messaging.

4. Scheduling, Gang-Scheduling, and Resource Allocation

Pathways utilizes a centralized gang-scheduler per accelerator island:

initialize ready_queue = ∅
loop:
  for each new scheduling_request r:
    insert r into ready_queue (according to policy)
  if ready_queue ≠ ∅:
    r = pop(ready_queue)
    assign accelerator_ids = allocate_slices(r.shape, r.size)
    for each shard s of r:
      send DISPATCH(shard_id=s.id,
                    accel_id=accelerator_ids[s.id],
                    input_futures=r.input_handles[s.id])
  wait for next event (new_request or shard_done)

All shards of a function are launched synchronously, enforcing robust barrier semantics per batch, and ensuring deadlock-free collectives. Data locality is optimized by allocating contiguous mesh slices.

DFlow (Shi et al., 2023) partitions DAG workflows and assigns each worker its entry points and downstream successors, removing the need for centralized queueing. Local schedulers invoke containers as soon as they are statically reachable, overlapping execution stages, barring only on payload readiness.

AIR (Venugopal et al., 2020) achieves peer-to-peer scheduling by mapping each event dynamically among ranks and pipelines processing via MPI channels and local thread concurrency.

5. Data Movement, Buffering, and Interconnect Coordination

Efficient data movement is critical. Pathways exploits an in-process sharded object store tracking HBM/DRAM buffers, with on-device RDMA "send" primitives routing outputs directly into successor shards’ buffers. Batching minimizes control-message overhead; back-pressure stalls host dispatch when buffers fill.

Bandwidth-bound throughput is modeled as $throughput \approx \min \{ B_{link}, R_{compute} \}$ .

DFlow's DStore (Shi et al., 2023) offers metadata-only directory services distributed among master/workers. It enables receiver-driven decentralized coordination, chunked payload retrieval, and automatic block/wake-up policies to maximize link utilization ( $\sim$ 0.9 $B_{max}$ achieved in tests). Out-of-order invocation, chunked data transfers, and multiple location metadata further promote network efficiency and minimize tail latency.

AIR (Venugopal et al., 2020) binds data transfer directly to MPI thread concurrency, using channel tags for message routing, and achieves message locality via deterministic sharding.

6. Performance Characterization and Case Studies

Measured throughput and latency in production scenarios:

Pathways (Barham et al., 2022): SPMD at 2048 TPUs achieves ~100% accelerator utilization for compiled functions ≥35ms; T5-Base (270M) at 32 TPUs yields 618k tokens/s; large-scale pipeline overhead is ∼1–2%, and sharding across two 512-TPU islands recoups 97% throughput parity.
DFlow (Shi et al., 2023): Serverless benchmarks show up to 60% tail-latency improvement over controlflow systems (average reduction: CFlow 52%, FaaSFlow 28%, FaaSFlowRedis 20%, KNIX 36%), 2–4x bandwidth improvement, and a 5.6x reduction in cold-start time compared to CFlow.
AIR (Venugopal et al., 2020): Streaming analytics (SWA, YSB, windowed join) scale to 224 cores, delivering up to 12–15x throughput and multi-fold latency reduction versus Spark and Flink, with sustainable handling of 36.6GB/s and sub-1s window latency at tens of millions of events/sec.

7. System Comparisons, Limitations, and Future Directions

Comparison summary:

System	Scheduling Model	Data Movement	Notable Findings
Pathways	Single controller, gang-scheduling	RDMA/ICI on accelerator	Matches JAX SPMD utilization; efficient pipeline/island sharding
DFlow	Distributed scheduler	In-memory chunked DStore	Superior tail latency, bandwidth, and cold start vs. controlflow
AIR	Masterless (peer-to-peer)	MPI/pthreads/direct	Outperforms Spark/Flink by up to 15×; barrierless, highly scalable

Limitations and avenues for extension include dynamic shape and control flow support, fine-grained preemption/migration, extended resource policies (e.g., SLAs, tail isolation), heterogeneous accelerator integration, and robust fault-tolerance (checkpointing/log-replay). For DFlow and AIR, the lack of built-in fault recovery currently necessitates full rerun on failure; sharding metadata catalogues, hierarchical directories, and lineage-based recovery are recognized future steps.

A plausible implication is that further integration of dynamic routing primitives, tenant-aware scheduling, compiler-informed sparsification, and more advanced object store semantics could push utilization and flexibility even closer to theoretical maxima, particularly for emerging ML architectures and distributed workflows.