Pathways: Asynchronous Distributed Dataflow for ML (2203.12533v1)

Published 23 Mar 2022 in cs.DC and cs.LG

Abstract: We present the design of a new large scale orchestration layer for accelerators. Our system, Pathways, is explicitly designed to enable exploration of new systems and ML research ideas, while retaining state of the art performance for current models. Pathways uses a sharded dataflow graph of asynchronous operators that consume and produce futures, and efficiently gang-schedules heterogeneous parallel computations on thousands of accelerators while coordinating data transfers over their dedicated interconnects. Pathways makes use of a novel asynchronous distributed dataflow design that lets the control plane execute in parallel despite dependencies in the data plane. This design, with careful engineering, allows Pathways to adopt a single-controller model that makes it easier to express complex new parallelism patterns. We demonstrate that Pathways can achieve performance parity (~100% accelerator utilization) with state-of-the-art systems when running SPMD computations over 2048 TPUs, while also delivering throughput comparable to the SPMD case for Transformer models that are pipelined across 16 stages, or sharded across two islands of accelerators connected over a data center network.

Citations (115)

View on Semantic Scholar

Summary

The paper introduces a novel orchestration layer, Pathways, that manages large-scale ML workloads using a single-controller asynchronous approach.
It incorporates support for TensorFlow and JAX, enabling optimized virtual device allocation and efficient gang-scheduling across accelerators.
The evaluation shows high throughput and minimal dispatch overhead compared to traditional multi-controller systems in real-world ML tasks.

Detailed Analysis of "Pathways: Asynchronous Distributed Dataflow for ML" (2203.12533)

Introduction

The paper "Pathways: Asynchronous Distributed Dataflow for ML" addresses the limitations of current ML systems by introducing a novel orchestration layer, Pathways. Designed for managing large-scale ML workloads, Pathways adopts a single-controller model to achieve flexibility in computation and resource management. The system leverages a sharded dataflow graph of asynchronous operators, gang-scheduling across accelerators, and efficient data transmission over interconnected systems to enhance ML model performance and facilitate the exploration of new computational paradigms.

System Design and Motivation

Pathways arises from the need to balance new scalable ML workload requirements with existing state-of-the-art performance standards. Traditional systems often rely on the SPMD paradigm, but this approach encounters limitations when scaling very large models or incorporating computational sparsity and pipelining techniques. Pathways leverages a single-controller architecture to simplify the expression of complex parallel patterns and support asynchronous distributed execution.

Figure 1: Comparison of dispatch overheads and communication patterns between multi-controller and single-controller systems.

Programming Model

Pathways supports both TensorFlow and JAX, offering a seamless transition for existing users while providing enhancement capabilities like virtual device allocation and program tracing. Users can specify computational requirements and exploit JAX transformations to optimize resource mapping and execution performance. A notable feature is Pathways' program tracing capabilities, which allow the amalgamation of multiple computations into a streamlined execution graph, improving efficiency and responsiveness.

System Architecture and Execution

Pathways is built on a robust client-server architecture. It utilizes a centralized resource manager to dynamically allocate virtual devices and enable efficient gang scheduling across a distributed network. The asynchronous distributed dataflow model allows operations to be performed independently, minimizing coordination overhead.

Figure 2: Pathways system overview.

Figure 3: Sequential vs. Parallel dispatch for a three-node program.

Resource Management and Parallel Dispatch

Pathways introduces a parallel asynchronous dispatch strategy, enabling simultaneous preparation and execution of computations. This reduces the host-side bottleneck, especially beneficial when the computational task duration is shorter than coordination time. By pre-emptively processing function requirements, Pathways can maintain high throughput even in resource-intensive tasks.

Evaluation and Results

The paper provides a comprehensive evaluation of Pathways, demonstrating its ability to achieve high efficiency and performance parity with existing multi-controller systems like JAX. Through micro-benchmarks and real-world ML model evaluations, Pathways exhibits outstanding scalability and minimal overhead in both single-controller dispatch and concurrent program execution scenarios.

Figure 4: Dispatch overhead of Pathways compared to TF, JAX, and Ray.

Figure 5: Smallest computation to match throughput between Pathways and JAX.

Implications and Future Work

The introduction of Pathways has significant implications for ML research and deployment. By supporting fine-grained data-dependent control flow and enabling research into new model architectures like Mixture of Experts, Pathways opens possibilities for efficient, scalable ML systems. Future developments could focus on expanding Pathways' resource management capabilities and enhancing interoperable execution with more diverse hardware configurations.

Conclusion

Pathways represents a highly efficient orchestration layer capable of supporting next-generation ML workloads. With its innovative scheduling and execution strategies, it manages to match or exceed the performance of state-of-the-art systems while providing the flexibility necessary for ongoing and future ML research needs. Through careful engineering and system design, Pathways sets a new benchmark in distributed computing frameworks for machine learning.

Figure 6: 3B Transformer model pipelined over 128 TPUs.

Pathways thus successfully addresses the need for a more flexible, performant orchestration layer in distributed ML systems, highlighting a path forward for sophisticated computation and resource management strategies in large-scale AI applications.

PDF Markdown

Whiteboard

Generate a whiteboard explanation of this paper.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Explain it Like I'm 14

Overview

This paper introduces Pathways, a new computer system made by Google that helps train and run very large machine learning models across thousands of special chips (like TPUs and GPUs). Think of Pathways as a smart “conductor” for a giant orchestra of computers: it decides what each chip should do, when to do it, and how to move data between them so everything runs smoothly and fast.

The big idea is to keep flexibility for new kinds of ML models while still getting top speed for today’s models. Pathways can handle complex setups, like splitting a big model into stages (pipelining), or spreading work across different groups of chips in different places, without slowing down.

Key Questions the Paper Tries to Answer

Here’s what the authors wanted to figure out, in simple terms:

How can we run huge ML models on thousands of chips efficiently, even when the models don’t fit the “everyone does the same thing at the same time” style?
Can we make one central controller that keeps everything organized, but still be as fast as systems where every computer runs its own copy of the program?
How do we move data between chips quickly and in smart ways, especially when they’re in different “islands” or buildings?
Can we share chips between different users and programs without wasting time or memory?
Will this work for real models, like Transformers used in language tasks, and still match state-of-the-art performance?

How Pathways Works (Methods and Approach)

To make this accessible, think of Pathways like a city’s traffic system:

The “Resource Manager” is like a city planner. It hands out “virtual streets” (virtual devices) that map onto real roads (physical chips). This helps place work in the best spots for fast communication.
The “Scheduler” is like a traffic light system. It makes sure groups of cars (computations) start in the right order and at the same time when needed (“gang scheduling”), so nobody blocks the intersection.

Here are the main ideas explained with everyday language:

Single Controller: Instead of running the same program separately on every computer (“multi-controller”), Pathways has one “boss program” that sends instructions to all chips. This makes it easier to handle complicated models and share resources.
Sharded Dataflow Graph: Imagine breaking a big task into smaller pieces (“shards”), like slicing a pizza. Each piece is a node in a graph, and arrows show how data flows from one piece to another. Pathways keeps this graph compact by treating whole groups of shards as single nodes when possible.
Futures: A “future” is just a promise of a result that will arrive later. Using futures lets Pathways start preparing the next steps before current steps are fully done, keeping chips busy.
Parallel Asynchronous Dispatch: Instead of waiting for each step to finish before setting up the next one, Pathways prepares many steps at once in parallel. Think of setting the table for all dinner courses early, so you don’t pause between courses.
Gang Scheduling: When multiple chips must start together (like a band playing a song in sync), Pathways schedules them to avoid deadlocks or waits.
Smart Data Movement: Pathways automatically sends data where it’s needed—sometimes over fast chip-to-chip links (like ICI on TPUs), sometimes over the data center network (DCN). It can “scatter” and “gather” data as needed, like distributing and collecting papers in a classroom.
Integration with JAX and TensorFlow: Pathways can run existing ML code with minimal changes, and it can turn JAX functions into compiled chunks that run efficiently on TPUs.

The authors test Pathways using:

Micro-benchmarks: Tiny programs that measure overheads, like timing how fast a simple “add and reduce” step runs.
Real models: Big Transformer models (including T5 variants and a 3-billion-parameter LLM), trained on up to 2,048 TPU cores.

Main Findings and Why They Matter

Near 100% Accelerator Utilization: Pathways can keep thousands of chips almost fully busy, matching the speed of top systems like JAX when training standard “SPMD” models (where all chips run the same code on different data).
Works Across Multiple TPU Pods: It’s the first system designed to run programs spanning multiple TPU groups (“pods”), meaning it scales beyond a single cluster.
Pipelining Performance: For Transformer models split into up to 16 stages, Pathways achieves throughput close to the standard SPMD method, even when the stages are spread across different islands of chips connected over a data center network.
Parallel Dispatch Helps: When computations are short, organizing and preparing the next steps in parallel prevents the system from stalling. This keeps performance high.
Efficient Multi-Tenancy: Pathways can share chips among different users and programs, scheduling fairly (e.g., giving each program a proportional share), with little or no switching overhead.
Flexible Yet Fast: Unlike older single-controller systems that slowed down when models got complex, Pathways retains the flexibility for future ML designs without sacrificing performance.

These results are important because they show you don’t have to choose between flexibility and speed. Pathways can handle modern and future ML models—like those with pipelines, sparse computations, or mixtures of experts—without collapsing under the coordination and data movement challenges.

Implications and Impact

Faster Progress on Large Models: Researchers can try out new ideas (like fine-grained control flow, mixtures of experts, or cross-task sharing) without rewriting everything to fit a rigid style.
Better Use of Expensive Hardware: Pathways makes it easier to share accelerators and run multiple tasks together efficiently, which reduces waste and cost.
Scales to Future Needs: As ML models grow and become more complex, and as chip clusters get bigger and more varied, Pathways’s design will still work.
Easier Experimentation: Because Pathways integrates with familiar tools like JAX and TensorFlow, teams can adopt it without large code changes and still get high performance.

In short, Pathways is like upgrading from many separate band leaders to a single expert conductor who can handle many styles of music, coordinate musicians across multiple stages, and keep everything playing in sync—at full speed—today and as the music gets more complex in the future.

View Paper Prompt View All Prompts

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Generate Now

Continue Learning

Authors (16)

First 10 authors:

Collections

Tweets

This paper has been mentioned in 1 tweet and received 26 likes.

Upgrade to Pro to view all of the tweets about this paper:

Start a free 7-day Pro trial

YouTube

Show All Videos

Pathways: Asynchronous Distributed Dataflow for ML (2203.12533v1)

Summary

Detailed Analysis of "Pathways: Asynchronous Distributed Dataflow for ML" (2203.12533)

Introduction

System Design and Motivation

Programming Model

System Architecture and Execution

Resource Management and Parallel Dispatch

Evaluation and Results

Implications and Future Work

Conclusion

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

Overview

Key Questions the Paper Tries to Answer

How Pathways Works (Methods and Approach)

Main Findings and Why They Matter

Implications and Impact

Open Problems

Continue Learning

Related Papers

Authors (16)

Collections

Tweets

YouTube